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Abstract 


The  central  goal  of  compressed  sensing  is  to  capture  attributes  of  a  signal  using  very 
few  measurements.  The  initial  publications  by  Donoho  and  by  Candes  and  Tao  have 
been  followed  by  applications  to  image  compression,  data  streaming,  medical  signal 
processing,  digital  communication  and  many  others.  The  emphasis  has  been  on  ran¬ 
dom  sensing  but  the  limitations  of  this  framework  include  performance  guarantees, 
storage  requirements,  and  computational  cost.  This  thesis  will  describe  two  deter¬ 
ministic  alternatives. 

The  first  alternative  is  based  on  expander  graphs.  We  first  show  how  expander  graphs 
are  appropriate  for  compressed  sensing  in  terms  of  providing  explicit  and  efficient 
sensing  matrices  as  well  as  simple  and  efficient  recovery  algorithms.  We  show  that 
by  reformulating  signal  reconstruction  as  a  zero-sum  game  we  can  efficiently  recover 
any  sparse  vector.  We  provide  a  saddle-point  reformulation  of  the  expander-based 
sparse  approximation  problem,  and  propose  an  efficient  expander-based  sparse  ap¬ 
proximation  algorithm,  called  the  GAME  algorithm.  We  show  that  the  restricted 
isometry  property  of  expander  matrices  in  the  G-norm  ensures  that  the  GAME  algo¬ 
rithm  always  recovers  a  sparse  approximation  to  the  optimal  solution  with  an  l\jl\ 
data-domain  approximation  guarantee. 

We  also  demonstrate  resilience  to  Poisson  noise.  The  Poisson  noise  model  is  appro¬ 
priate  for  a  variety  of  applications,  including  low-light  imaging  and  digital  streaming, 
where  the  signal-independent  and/or  bounded  noise  models  used  in  the  compressed 
sensing  literature  are  no  longer  applicable.  We  develop  a  novel  sensing  paradigm 
based  on  expander  graphs  and  propose  a  MAP  algorithm  for  recovering  sparse  or 
compressible  signals  from  Poisson  observations.  We  support  our  results  with  exper¬ 
imental  demonstrations  of  reconstructing  average  packet  arrival  rates  and  instanta¬ 
neous  packet  counts  at  a  router  in  a  communication  network,  where  the  arrivals  of 
packets  in  each  flow  follow  a  Poisson  process. 

The  second  alternative  is  based  on  error  correcting  codes.  We  show  that  determin¬ 
istic  sensing  matrices  based  on  second  order  Reed  Muller  codes  optimize  average 
case  performance.  We  also  describe  a  very  simple  algorithm,  one-step  thresholding, 
that  succeeds  in  average  case  model  selection  and  sparse  approximation,  where  more 
sophisticated  algorithms,  developed  in  the  context  of  random  sensing,  fail  completely. 

Finally,  we  provide  an  algorithmic  framework  for  structured  sparse  recovery,  where 
some  extra  prior  knowledge  about  the  sparse  vector  is  also  available.  Our  algorithm, 
called  Nesterov  Iterative  Hard- Thresholding  (NIHT)  uses  the  gradient  information 
in  the  convex  data  error  objective  to  navigate  over  the  non-convex  set  of  structured 
sparse  signals.  Experiments  show  however  that  NIHT  can  empirically  outperform 
G -minimization  and  other  state-of-the-art  convex  optimization-based  algorithms  in 
sparse  recovery. 
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Preliminaries 


Chapter  1 
Introduction 


An  emerging  challenge  for  information  and  inference  systems  is  to  acquire  and  analyze 
the  ever-increasing  high- dimensional  data  produced  by  the  vast  natural  and  manmade 
phenomena.  Sampling,  streaming,  and  recoding  of  even  the  most  primitive  data,  e.g., 
in  medical  imaging  and  network  monitoring,  now  produce  a  data  deluge  that  severely 
stresses  the  available  analog-to-digital  converter,  digital  communication  bandwidth 
and  storage  resources;  hence,  the  traditional  paradigm  of  capturing  an  entire  data  set 
only  to  compress  it  for  the  subsequent  transmission  or  storage  is  becoming  no  longer 
feasible. 

Surprisingly,  while  the  ambient  data  dimension  is  large  in  many  problems,  the  rel¬ 
evant  information  therein  typically  resides  in  a  much  lower  dimensional  space.  This 
observation  has  led  to  several  new  theoretical  and  algorithmic  developments  under 
different  communities,  including  theoretical  computer  science,  machine  learning,  ap¬ 
plied  mathematics,  and  digital  signal  processing.  One  such  development  is  called 
compressed  sensing  (CS),1  which  exploits  sparse  representations  [95,  54,  20]. 

The  central  goal  of  compressed  sensing  is  to  capture  attributes  of  a  signal  using 
very  few  measurements.  In  most  work  to  date,  this  broader  objective  is  exemplified 
by  the  important  special  case  in  which  the  measurement  data  constitute  a  vector 
f  =  +  where  $  is  an  M  x  N  matrix  called  the  sensing  matrix ,  a*  is 

a  vector  in  which  can  be  well-approximated  by  a  k- sparse  vector,  where  a  k- 
sparse  vector  is  a  vector  which  has  at  most  k  non-zero  entries,  and  e m  is  additive 
measurement  noise. 

When  $  satisfies  the  so-called  restricted  isometry  property  (RIP),  it  preserves  the  ge¬ 
ometric  information  of  the  set  of  sparse  signals  [95,  56].  Based  on  this  observation,  we 
can  tractably,  stably  and  provably  approximate  sparse  signals  from  M  >  2fclog  (y) 
measurements  using  convex  optimization  [54,  55]  or  greedy  algorithms  [237,  199]. 

The  role  of  random  measurement  in  compressive  sensing  (see  [54]  and  [95])  can  be 

1Not  to  be  confused  with  the  canonical  acronym  of  Computer  Science. 
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viewed  as  analogous  to  the  role  of  random  coding  in  Shannon  theory.  Both  provide 
worst-case  performance  guarantees  in  the  context  of  an  adversarial  signal/error  model. 
Although  it  is  known  that  certain  probabilistic  processes  generate  MxN  measurement 
matrices  that  satisfy  the  RIP  with  high  probability,  there  is  no  practical  algorithm  for 
verifying  whether  a  given  measurement  matrix  has  this  property.  Storing  the  entries 
of  a  random  sensing  matrix,  and  performing  matrix-vector  multiplications  may  also 
require  significant  resources. 

These  drawbacks  lead  us  to  consider  constructions  with  deterministic  alternatives, 
which  do  not  suffer  from  the  same  drawbacks.  This  thesis  will  describe  two  deter¬ 
ministic  alternatives.  The  frameworks  presented  here  provide 

•  easily  checkable  conditions  on  special  types  of  deterministic  sensing  matrices 
guaranteeing  successful  sparse  approximation  and  model-selection  guarantees; 

•  storage  efficiency,  as  the  entries  of  these  matrices  can  be  computed  on  the  fly, 
and 

•  recovery  algorithms  with  lower  complexities  exploiting  the  structure  of  the  sens¬ 
ing  matrices. 

The  first  framework  is  based  on  expander  graphs.  We  show  that  by  reformulating 
signal  reconstruction  as  a  zero-sum  game  we  can  efficiently  recover  any  sparse  vector. 
We  also  demonstrate  resilience  to  Poisson  noise. 

The  second  alternative  is  based  on  algebraic  error  correcting  codes.  We  show  that 
deterministic  sensing  matrices  based  on  second  order  Reed  Muller  codes  optimize 
average  case  performance. 
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Chapter  2 
Notation 


2.1  Vector  Properties 

Throughout  this  thesis,  nonnegative  reals  (respectively,  integers)  will  be  denoted  by 
R+  (respectively,  Z+).  By  [—1,1],  we  mean  the  interval  between  —1  and  1,  whereas 
{  — 1, 1}  is  the  discrete  set  with  the  elements  —1  and  1.  For  every  integer  N,  we  denote 
[JV]  =  {1,...,JV}. 

We  denote  vectors  by  bold  small  letters  v ,  and  we  denote  matrices  by  bold  capital 
letters  3>.  Given  a  vector  u  €  and  a  set  S  C  [TV],  we  will  denote  by  u$  the  vector 
obtained  by  setting  to  zero  all  coordinates  of  u  that  are  in  Sc,  the  complement  of  S. 
Similarly  if  $  is  an  M  x  N  matrix,  then  $5  denotes  the  M  x  |«S|  submatrix  of  $ 
which  is  obtained  by  restricting  the  columns  of  $  to  the  subset  S.  Also  v^j  denotes 
the  vector  v  restricted  to  entries  i,i  +  1,  •  •  •  ,  j,  that  is  v^j  =  (ry,  vi+i,  ■  ■  ■  ,  v3 ) . 

A  vector  v  e  is  k- sparse  if  it  has  at  most  k  non-zero  entries.  The  support  of 
the  A;-sparse  vector  v ,  denoted  as  Supp(v),  indicates  the  positions  of  the  non-zero 
elements.  The  pseudo-norm  of  v ,  also  called  the  Hamming  weight  of  v  is  denoted 
by  ||u||o,  and  indicates  the  number  of  non-zero  entries  of  v.  In  other  words,  v  is 
/c-sparse,  if  and  only  if  ||u||  0  <  k. 

For  each  positive  integer  p,  the  ip  norm  of  a  vector  v  is  defined  as 

/  n  \  p 

iMiP=(X>irj  • 

Also,  throughout  this  thesis  we,  for  every  vector  v  define 

IMImin  =  nrin  \vi\, 

i:Vif  0 

as  the  magnitude  of  the  smallest  nonzero  entry  of  v.  Holder  inequality  is  an  important 
inequality  widely  used  for  bounding  the  inner-products  between  arbitrary  vectors. 
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Theorem  2.1  (Holder  inequality).  Let  u  and  v  be  arbitrary  vectors  in  RN ,  and  let 
p  mid  q  be  positive  integers  such  that  1  +  1  =  1.  Then 

(u,v)  <  HlpiMl,, 

and  the  equality  holds  if  and  only  if  there  exist  real  non-negative  numbers  c±  and  c2, 
not  both  of  them  zero,  such  that  for  every  index  i  G  {1,  •  •  •  ,  N}  c\\ui\p  =  c2\vi\q . 

Two  key  operators  which  are  widely  used  in  this  thesis  are  the  Hard  Thresholding 
and  Soft  Thresholding  operators: 

Definition  2.2  (Hard  Thresholding).  Let  Hfc()  :  be  a  function  that  sets 

to  zero  all  but  the  k  largest  coordinates  in  absolute  value.  More  precisely,  for  each 
v  G  M.N ,  let  i t  be  a  permutation  of  {!,■■■  ,N}  such  that  |u,r(i)|  >  KV(2)|  >  •••  > 
|u^(jv)|-  Then  the  vector  Hfc(v)  is  a  k-sparse  vector  a  where  a ^  =  tq-p)  for  i  <  k 
and  «f(j)  =  0  for  i  >  k  +  1. 

The  hard  thresholding  operator  Hfc()  gives  the  best  /c-sparse  approximation  of  any 
vector  (3  G  ,  that  is  for  every  norm  p 

Hfc(u)  =  arg  min  ||u-y-  v'\\p.  (2.1.1) 

v':k— sparse 

This  best  /c-sparse  approximation  can  be  computed  efficiently  in  time  0(N  log  N)  by 
first  sorting  the  elements  of  v ,  and  then  selecting  the  k  largest  elements.  Also  for 
every  norm  p,  we  define 

akfv)  =  ||u  -  Hfc(w)||i.  (2.1.2) 

In  other  words,  <Jk(v)  is  the  the  best  fc-term  approximation  error  to  v  in  the  i\  norm. 

Definition  2.3  (Soft  Thresholding).  For  6  G  M+,  we  define  the  soft  thresholding 
function  S (a,  9)  as 

(  9  if  a  >  6 

S(a,0)=<  - 9  if  a  <  —9  (2.1.3) 

[  a  otherwise. 

For  a  subset  S  C  [. N ]  we  will  denote  by  Is  the  vector  with  components  lpesp 
1  <  i  <  N .  Given  a  vector  u ,  we  will  denote  by  u+  the  vector  obtained  by  setting 
to  zero  all  negative  components  of  u:  for  all  1  <  i  <  N,  uf  =  max{0,  u,}.  Given  two 
vectors  u,v  G  we  will  write  u  F  v  if  iq  >  v,  for  all  1  <  i  <  N.  If  u  F  £I\n] 
for  some  £  G  1,  we  will  simply  write  u  F  a.  We  will  write  y  instead  of  y  if  the 
inequalities  are  strict  for  all  i. 
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2.2  Matrix  Properties 


Let  $  be  an  M  x  N  matrix.  We  denote  the  ith  column  of  $  by  ( and  denote  the 
entry  at  the  jth  row  of  the  ith  column  of  4>  by  .  The  null-space  of  denoted  by 
A/#,  is  the  set  of  all  N  dimensional  vectors  v  with  =  0.  We  also  use  to  denote 
the  conjugate  transpose  of  and  use  4P  to  denote  the  MoorePenrose  pseudoinverse 
of  <h,  that  is 

An  N  x  N  matrix  U  is  unitary  if  and  only  if  UT  =  U~l .  It  is  well  known  that  any 
M  x  N  matrix  $  can  be  decomposed  as  $  =  UJ2V,  where  U  is  an  M  x  M  unitary 
matrix,  V  is  an  N  x  N  unitary  matrix,  and  X  is  an  M  X  N  diagonal  matrix  [149]. 
The  elements  of  the  diagonal  of  X  are  the  singular  values  of  4>. 

For  each  positive  integer  p,  the  Iv  norm  of  a  matrix  $  is  defined  as 

1 1  (f>7?  1 1 

||<fr|L  =  max  „  „llp.  (2.2.1) 

In  particular  ||<h||2  =  crmax(<l>),  where  crmax($)  is  the  maximum  singular  value  of  <F, 
and 

Halloo  =  max  max  \ipa\. 

Similarly,  for  every  integer  k,  the  restricted  iv  norm  of  a  matrix  4>  is  defined  as 

II  || 

||$||fciP  =  max  -77 — 77  — .  (2.2.2) 

v:  k— sparse  ||^||p 

Theorem  2.4.  Let  $  be  an  M  x  N  matrix,  and  let  v  be  a  vector  in  M.N .  Then 

Moo  <  ll^lloolHIl. 


Proof.  For  every  index  j  G  M,  from  the  triangle  inequality  we  get 


(M 


N 

'y  ]  viTi,j 


2=1 


N 


2=1 


Therefore  U^uUoo  =  maxjeM  |(<&u)j|  <  ||  $||oo||  v||i. 


□ 


An  M  X  N  matrix  $  with  normalized  columns  is  called  a  dictionary. 

Definition  2.5  (Tight  Frame).  A  dictionary  is  a  tight-frame  with  redundancy  jj  if  for 
every  vector  v  e  M.N ,  ||$n||2  =  jj  ||u||2.  If  4><IP  =  jjImxm,  then  $  is  a  tight-frame 
with  redundancy  jj. 

The  following  Proposition  states  that  tight-frames  have  the  lowest  spectral  norms 
among  all  dictionaries  of  the  same  size. 

Proposition  2.6.  Let  $  be  an  M  x  N  dictionary.  Then  ||4?||2  >  jj,  and  equality 
holds  if  and  only  if  $  is  a  tight  frame  with  redundancy  fj . 
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2.3  Function  Properties 


Definition  2.7  (Convex  set).  A  set  S  is  convex  if  for  every  pair  of  points  P  and  Q 
in  S,  and  every  a  G  [0, 1],  the  point  R  =  aP  +  (1  —  a) Q  is  also  in  S. 

Definition  2.8  (Convex  function).  A  function  TZ  :  S  — »•  M  is  convex  if  S  is  a  convex 
set  and  moreover,  for  every  pair  of  points  P  and  Q  in  S,  and  every  a  G  [0, 1]  ,  we 
have 

TZ(aP  +  (1  -  a)Q)  <  a77(P)  +  (1  -  a)77(Q). 

Theorem  2.9  (Convex  function).  A  differentiable  function  TZ  :  S  — >  M  is  convex  if 
S  is  a  convex  set  and  moreover,  for  every  pair  of  points  P  and  Q  in  S 

7Z(P)>7Z(Q)-((P-Q),V7Z(Q)). 

Definition  2.10.  A  differentiable  function  TZ  :  S  — >  M  is  strongly  convex  with  pa¬ 
rameter  a,  if  for  every  pair  of  points  P  and  Q  in  S, 

77(P)  -  TZ( Q)  -  ((P  -  Q),  V77(Q))  >  |||P  -  Q\\l 

Definition  2.11  (Big-O’  notation).  f(n)  =  0(g(n ))  (alternatively,  f(n )  ^  gfn))  if 
3  c0  >  0 ,nQ  :  V  n  >  nQ,  f(n)  <  c0g(n),  f(n)  =  £l(g(n))  (alternatively,  f(n )  ^  g(n))  if 
g(n)  =  0(f(n)),  and  f(n )  =  0(^(n))  (alternatively,  f(n )  x  g(n ))  if  g(n)  ^  f{n)  ^ 
g(n). 

2.4  Concentration  Inequalities 

In  this  section,  we  provide  the  main  concentration  inequalities  which  are  used  through¬ 
out  the  thesis. 

Theorem  2.12  (Gaussian  tail  bound).  Let  X  ~  A/”(0,  cr2)  be  a  zero-mean  Gaussian 
random  variable  with  variance  a2.  Then  for  all  0  <  e,  we  have 

Pr[|X|  >  ea\  <  2  exp 

Theorem  2.13  (^oo-Norm  of  the  Projection  of  a  Complex  Gaussian  Vector).  Let  $ 
be  a  real-  or  complex-valued  M  x  N  matrix  having  unit  l2-norm  columns  and  let  v  be 
an  N  x  1  vector  having  entries  independently  distributed  as  Af(0,cr2).  Then  for  any 
e  >  0,  we  have 


Pr  (||$Tu||00  >  ere) 


4iV 


exp(— e2/2) 
e 


(2.4.1) 
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Proof.  Assume  without  loss  of  generality  that  a  =  1,  since  the  general  case  follows 
from  a  simple  rescaling  argument.  Let  ipi, ... ,  E  CAI  be  the  N  columns  of  $  and 
define 


Zi  =  i  =  1, . . . ,  N. 


(2.4.2) 


Note  that  the  zfs  are  identically  (but  not  independently)  distributed  as  z 

i  Af(0,l), 

which  follows  from  the  fact  that  Vi  Af( 0, 1)  and  the  columns  of  $  have  unit 

^2-norms.  The  rest  of  the  proof  follows  from  the  facts  that 


Pr  (||$Tn||00  >  e)  <  N  •  Pr  (|Re(^i)|2  +  |Im(zi)|2  >  e2) 


(b) 

<  2N  •  Pr 


(|Re(„)l  >  £ 


(c)  4N 

<  7^ 


exp(— e2/2) 
e 


Here,  (a)  follows  by  taking  a  union  bound  over  the  event  (Jid^l  —  e}>  (P)  follows 
from  taking  a  union  bound  over  the  event  {|Re(zi)|  >  e/-\/2}U{|Im(zi)|  >  e/\/2}  and 
noting  that  the  real  and  imaginary  parts  of  zf  s  are  identically  distributed  as  A/"(0,  |), 
and  (c)  mainly  follows  by  upper  bounding  the  complementary  cumulative  distribution 
function  [167].  □ 

Theorem  2.14  (^-concentration  [161]).  Let  X  ~  Xm  be  «  chi-squared  random  vari¬ 
able  with  m  degrees  of  freedom,  with  mean  ma2,  and  with  standard  deviation  \/2ma2 . 
Then  for  all  0  <  e  <  \,  we  have 


Pr  [X  —  ma 2  >  ema2]  <  exp 


Theorem  2.15  (Azuma’s  Inequality  [12]).  Suppose  (Z0l  Zl:  ■  ■  ■  ,  Zk)  is  a  bounded- 
difference  martingale  sequence,  that  is  for  each  i,  E  \Zf\  =  Z^\,  and  \Zi  —  Z%_\  <  a. 
Then  for  all  e  >  0, 


Pr  [| Zk  —  Z0\  >  e\  <  2  exp 


In  this  thesis,  we  use  the  Azuma’s  Inequality  for  complex  martingale  random  variables. 

Theorem  2.16  (Complex  Azuma’s  Inequality).  Let  (Z0,  Z\,  ■  •  •  ,  Zf)  be  a  set  of 
complex  random  variables  such  that,  for  each  i,  E  [Zf  =  Zi_1}  and  \Z%  —  Z,_i\  <  Ci. 
Then  for  all  e  >  0, 


Pr[|  Zk 


Z0 1  >  e]  <4  exp 


8 


Proof.  For  each  random  variable  Z*  let  Xt  =  R e(Zf)  and  Yj  =  Irn  (Z,:),  so  that 
Zi  =  Xt  +  iYt.  Then  E  [Xf\  =  and  E  [Yf  =  Yt_\.  Moreover,  by  triangle  in¬ 

equality  \Xi  —  X,_i|  <  \Zi  —  Zj_i|  <  Ci,  and  |F*  —  F5_i|  <  Z,  —  Z(_  i  <  Ci.  Hence, 
(X0,  •  •  •  ,Xm),  and  (bo,  •  •  •  ,  Ym)  form  martingale  sequences.  Now  from  the  triangle 
inequality  we  have 


Pr  [| Zk  -  Z0 1  >  e]  <  Pr 
e 


Pr 


Ym  -  y0  >  x 


\XV 
<  4  exp 


XQ\  >  - 

u  I  — 


;  Y  c 2 

Xi=  i  s 


□ 


2.5  Group  Theory 

In  this  thesis,  we  will  analyze  deterministic  sensing  matrices  for  which  the  columns 
form  a  group  Q  under  pointwise  multiplication.  The  multiplicative  identity  is  the 
column  1  with  every  entry  equal  to  1.  The  following  property  is  fundamental. 

Lemma  2.17.  If  a  group  Q  has  at  least  one  identity  f  different  from  the  identity 
element,  then  the  group  Q  satisfies  9  =  ® 

Proof. 

/fes)  =  5Z(/9)  =  5>- 

\g£G  /  g&G  g&G 

Therefore  we  have 

(!-/)  fes)  =  °> 

\g&G  J 

and  since  /  /  1  we  must  have  9  —  0.  □ 
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Chapter  3 


An  Overview  of  Compressed 
Sensing 

3.1  What  is  Compressed  Sensing? 

The  central  goal  of  compressed  sensing  is  to  capture  attributes  of  a  signal  using 
very  few  measurements  [95,  48,  22],  In  most  work  to  date,  this  broader  objective  is 
exemplified  by  the  important  special  case  in  which  the  measurement  data  constitute 
a  vector  /  =  <&«*,  where  is  an  M  x  N  matrix  called  the  sensing  matrix  ,  and  a* 
is  a  fc-sparse  vector  in  RN  (with  k  <C  N)  [60,  86]. 

There  are  three  main  objectives  in  compressive  sensing: 

•  (01):  Efficient  reconstruction  of  any  k- sparse  vector  a*  from  the  measurement 
vector  f  =  efficiently. 

•  (02):  Minimizing  the  number  of  required  measurements  for  reconstruction 
(M  «  k  <  N). 

•  (03):  Robustness  against  data-domain  and  measurement- domain  noise. 

Based  on  the  above  objectives,  compressed  sensing  can  be  viewed  as  a  process  con¬ 
sisting  of  two  complementary  tasks: 

1.  (Tl):  Designing  an  appropriate  M  x  N  sensing  matrix 

2.  (T2):  Designing  an  efficient  reconstruction  algorithm. 

Objective  (01)  requires  that  the  reconstruction  algorithm  recovers  a*  from  f  without 
knowing  its  support  a  priori.  A  necessary  condition  for  this  requirement  is  that  no 
two  fc-sparse  vectors  are  mapped  to  the  same  low- dimensional  vector.  Otherwise, 
there  is  no  way  to  distinguish  the  two  vectors  from  the  low-dimensional  measurement 
vector.  This  condition  imposes  a  constraint  on  the  number  of  required  measurements. 
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Proposition  3.1.  Let  $  be  an  M  X  N  matrix  that  does  not  map  any  pair  of  k-sparse 
vectors  into  the  same  low- dimensional  measurement  vector.  Then  the  rank  of  any 
M  x  2k  submatrix  of  &  is  2k,  and  therefore  M  >  2k. 

Proof.  Suppose  that  there  exists  an  M  x  2k  submatrix  of  $  with  rank  less  than  2k. 
Let  B  denote  the  indices  of  the  columns  of  this  submatrix.  Then  has  non-empty 
null-space,  and  therefore  there  the  exists  a  vector  v  in  the  null-space  of  $  which  is 
2A;-sparse  with  Supp(u)  =  B.  Write  v  =  ct  —  {3,  where  a,  and  f3  are  ^-sparse  vectors 
with  disjoint  supports.  Now  we  have 

$(ck)  —  3>(/3)  =  $(«  —  /3)  —  =  0. 

In  other  words,  there  are  two  distinct  /c-sparse  vectors  «,  and  (3  with  =  $/3. 
This  means  that  no  reconstruction  algorithm  can  distinguish  them  by  just  looking 
at  the  measurement  vector.  Now  suppose  M  <  2k.  Then  the  rank  of  any  M  x  2k 
submatrix  of  $  is  at  most  M  which  will  be  strictly  smaller  than  2k.  □ 

Remark  3.2.  If  every  M  x  2k  submatrix  of  &  has  rank  2k,  then  compressed  sensing 
is  information  theoretically  possible  using  the  sensing  matrix  $.  This  means  that  for 
every  k-sparse  vector  a* ,  given  $a*,  one  can  recover  a*  successfully  by  performing 
an  exhaustive  search  over  all  k-dimensional  subspaces  ofWN. 

Thus  far,  we  have  seen  that  there  is  a  trade-off  between  the  first  two  objectives  of  the 
compressed  sensing,  and  in  order  to  get  Objective  (01)  we  need  to  have  at  least  2k 
measurements.  The  last  objective  in  compressed  sensing  is  about  robustness  against 
noise.  Sparse  approximation  is  a  measure  of  stability  of  different  compressed  sensing 
methods,  and  was  originally  established  by  Kashin  [166]  with  later  improvements  by 
Gluskin  [128,  129], 

Definition  3.3  (Sparse  Approximation).  Let  p  and  q  be  positive  integers.  Let  $ 
be  an  M  x  N  sensing  matrix  ,  and  let  A$  be  a  reconstruction  algorithm  associated 
with  <&.  Then  Aq>  provides  Ip/Iq  sparse  approximation  guarantee  if  and  only  if  there 
exists  absolute  constants  C\,Ci,  such  that  for  every  a*  G  1^,  and  eM  e  RM ,  given 
f  =  +  eM,  A&  can  successfully  recover  a  k-sparse  vector  cf  with 

||«*  -  a||P  <  -yrrl|a*  -  Hfc(«*)ll<j  +  C2\\eM\\p- 

k  p 

In  the  rest  of  this  section  we  first  focus  on  the  noiseless  compressed  sensing  problem. 
We  will  see  that  noiseless  compressed  sensing  can  be  efficiently  done  by  designing 
appropriate  matrices  based  on  Reed-Solomon  codes  [5],  and  then  using  specific  algo¬ 
rithms  for  recovering  sparse  vectors  [253].  However,  robustness  against  the  noise  is  a 
lacking  in  that  approach. 

An  alternative  approach  is  to  start  with  a  generic  sparse  recovery  algorithm,  and  find 
sufficient  conditions  a  sensing  matrix  should  satisfy  in  order  to  guarantee  the  fidelity  of 
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the  recovery  algorithm.  Linear  programming  with  matrices  satisfying  the  Restricted 
Isometry  Property  is  an  example  of  this  approach  [54,  55].  In  Section  3.3  we  will 
see  that  this  approach  provides  robustness  against  noise.  However,  the  limitations  of 
this  approach  are  large  storage  and  computational  requirements.  Robustness  against 
the  noise  also  imposes  extra  lower-bounds  on  the  number  of  required  measurements 
which  are  discussed  in  detail  in  Section  3.5. 

To  overcome  these  difficulties,  in  this  thesis  we  introduce  alternative  deterministic  ma¬ 
trices  which  are  carefully  designed  to  tackle  the  robust  compressed  sensing  problem. 
Our  sensing  matrices  are  equipped  with  custom  reconstruction  algorithms.  These 
algorithms  exploit  the  structure  of  the  sensing  matrix,  and  provide  efficient  storage, 
compression,  and  recovery,  as  well  as  robustness  against  noise. 


3.2  Noiseless  Compressed  Sensing 

In  Section  3.1  we  introduced  three  major  objectives  of  compressed  sensing.  If  we 
ignore  the  Objective  (03),  then  we  are  left  with  the  noiseless  compressed  sensing 
problem.  The  main  tasks  in  the  noiseless  compressive  sensing  are 

1.  (NT1):  Designing  an  M  X  N  sensing  matrix  $  (with  M  ~  k  <C  N),  such  that 
the  rank  of  any  M  x  2k  submatrix  of  $  is  2k, 

2.  (NT2):  Designing  an  efficient  algorithm  for  solving  the  combinatorial  mini¬ 
mization  problem 


minimize  ||  o:7 1|  o  (3.2.1) 

subject  to  f  =  thcP. 

Solving  the  combinatorial  optimization  problem  of  Equation  (3.2.1)  is  in  general  NP- 
liard  [197].  However,  here  we  will  see  an  example  of  a  sensing  matrix  and  a  recon¬ 
struction  algorithm  that  can  efficiently  recover  any  fc-sparse  vector  using  only  M  =  2k 
measurements.  We  first  show  why  the  two  tasks  about  are  sufficient  to  guarantee  the 
achievement  of  the  Tasks  (Tl),  and  (T2)  in  compressed  sensing. 

Proposition  3.4.  Suppose  that  $  is  a  sensing  matrix  satisfying  the  Task  (NT1),  and 
let  A$>  be  an  algorithm  which  efficiently  solves  the  optimization  problem  of  Task  (NT2). 
Let  ot*  be  an  arbitrary  k-sparse  vector.  Then  given  f  =  the  reconstruction  al¬ 

gorithm  A  efficiently  recovers  ot*  uniquely. 

Proof.  (NT2)  guarantees  that  the  Algorithm  (.A$)  always  Ends  a  fc-sparse  vector,  ot, 
such  that  /  =  $a.  On  the  other  hand,  (NT1)  guarantees  that  if  ot  and  (3  are  two 
/c-sparse  vectors  with  4?  ck  =  <E»/3,  then  ot  =  (3.  Therefore,  since  both  at*  and  6t  are 
/c-sparse  and  =  f  =  ffiA.  we  must  have  ot  =  ot*.  □ 
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The  £0- minimization  problem  of  Equation  3.2.1  can  be  viewed  as  a  channel  coding 
problem  using  linear  codes  defined  over  the  field  of  real  numbers.  To  see  this,  let  $ 
be  an  M  x  N  sensing  matrix  ,  with  null-space  A/#.  Let  a+  be  a  solution  of  /  =  Qcx.' . 
Then  any  other  solution  of  /  =  t&a:',  is  given  by  ck+  —  A/$  =  {a+  —  v\v  G  A/#}. 
Thus  the  ^-minimization  problem  of  Equation  (3.2.1)  is  equivalent  to  the  problem  of 
finding  vNJ\f&  which  minimizes  ||aii  —  u||0. 

If  one  thinks  of  A f<&  as  a  linear  code  defined  over  the  field  of  real  numbers,  and  of  f 
as  the  received  word,  the  ^-minimization  problem  is  equivalent  to  Ending  the  error 
vector  of  minimum  (Hamming)  weight  over  all  the  codewords  v  G  A f&.  Problems  of 
this  nature  have  been  widely  studied  in  the  language  of  coding  theory  [183] ;  however, 
these  codes  are  typically  defined  over  finite  fields,  whereas  here  all  arithmetic  is  done 
over  the  field  of  real  numbers. 


Inspired  by  this  simple  but  fundamental  connection  between  the  noiseless  compressed 
sensing  and  the  theory  of  error-correcting  codes  several  coding  theory  based  construc¬ 
tions  are  proposed  for  solving  the  noiseless  compressed  sensing  problem  [5,  217,  152, 
56].  In  particular,  based  on  the  theory  of  algebraic  coding/decoding,  Akcakaya  and 
Tarokh  [5]  construct  Vandermonde  sensing  matrices  that  generalize  Reed-Solomon 
codes  using  Vandermonde  matrices.  Let  M  =  2k,  and  consider  the  M  x  N  sensing 
matrix 

1111 


$  = 


Zi 

zf 


Z2 

Z2 

^2 


ZN 
* N 


M—l 

Z1 


M—l 

z2 


VM- 1 


where  z\,  ■  ■  ■  ,Zn  are  N  distinct,  non-zero  real  numbers. 


Observe  that  since  M  =  2k,  any  M  x  2k  sub-matrix  of  $  is  a  2k  x  2k  Vandermonde 
matrix,  and  therefore  has  rank  2k.  In  the  language  of  algebraic  coding  theory,  the 
null-space  of  $  is  a  maximum  distance  separable  linear  code  of  length  N,  dimension 
N  —  M  and  minimum  distance  M  +  1,  and  can  be  viewed  as  a  generalization  of  the 
Reed-Solomon  code  over  the  field  of  the  real  numbers. 


The  Vandermonde  reconstruction  algorithm  (the  roots  of  which  go  back  to  1795!  -  see 
[87,  253])  ,  uses  the  same  idea  as  the  algebraic  algorithm  for  decoding  Reed-Solomon 
codes  [183,  231].  It  uses  the  input  data  to  construct  an  error- locator  polynomial;  the 
roots  of  this  polynomial  identify  the  signals  appearing  in  the  sparse  superposition. 
The  whole  reconstruction  process  can  be  done  efficiently  using  only  0(k 2)  operations. 

The  Vandermonde  construction  provides  optimality  in  the  number  of  required  mea¬ 
surements  (M  =  2k),  and  efficiency  in  sparse  reconstruction  ( 0(k 2)  operations).  Nev¬ 
ertheless,  because  the  correspondence  between  the  coefficients  of  a  polynomial  and  its 
roots  is  not  well  conditioned,  it  is  very  difficult  to  make  the  algorithm  robust  against 
the  noise.  This  difficulty  becomes  more  clear  in  Section  3.5  in  which  we  will  see  that 
in  the  robust  compressed  sensing  framework  at  least  fl  (k  log  ^)  measurements  are 
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necessary. 


3.3  Robust  Compressed  Sensing 

3.3.1  ^-minimization  and  Restricted  Isometry  Property 

The  coding  theory  approach  to  compressed  sensing  exploits  the  similarities  between 
the  -^-minimization  of  Equation  (3.2.1)  and  the  decoding  step  in  channel  coding. 
The  first  step  in  this  approach  is  to  design  a  proper  sensing  matrix  and  a  recovery 
algorithm  specific  to  the  designed  sensing  matrix.  However,  as  we  saw  earlier,  there 
are  fundamental  challenges  against  making  this  approach  robust. 

The  sparse  approximation  problem  has  also  been  extensively  investigated  in  the  statis¬ 
tics  and  machine  learning  communities  [142].  In  statistics  and  machine  learning,  we 
are  provided  with  M  training  examples.  Each  training  example  consists  of  N  dis¬ 
tinct  features,  and  the  goal  is  to  find  a  sparse  combination  of  the  features  that  best 
represents  the  labels  of  all  training  examples. 

More  precisely,  let  $  be  the  M  x  N  matrix  whose  rows  indicate  the  M  training 
examples,  and  at  each  row,  the  columns  represent  the  values  of  N  different  features  for 
that  training  example.  Also  let  f  be  an  M-dimensional  vector  in  that  corresponds 
to  the  (real-valued)  labels  for  the  M  training  examples.  The  goal  is  to  find  a  sparse 
vector  ol*  e  such  that  closely  approximates  f. 

For  simplicity,  first  consider  the  noiseless  case  in  which  there  exists  a  /c-sparse  vector 
a*  with  /  =  t&a*.  In  this  case,  the  sparse  feature  selection  problem  reduces  to  the 
£o  minimization  problem  of  Equation  (3.2.1): 

minimize  ||  o:7 1| o 
subject  to  /  =  f&aT 

As  the  £q  pseudo-norm  is  non-convex,  solving  this  optimization  problem  in  general  is 
NP-Hard  [197].  The  £\  minimization  is  an  alternative  and  tractable  approach,  which 
suggests  that  instead  of  solving  the  non-convex  £0  minimization  problem,  we  alter¬ 
natively  solve  the  convex  ^-minimization  problem  (also  known  as  the  Basis  Pursuit 
(BP)  problem  [73]): 


minimize  ||a:,||i  (3.3.1) 

subject  to  f  =  <&«'. 

The  £\  norm  is  a  convex  norm  that  has  the  most  similarity  to  the  non-convex  pseudo¬ 
norm  £0  [118].  The  following  2-dimensional  example  provides  insight  into  why  £y 
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f  =  *a' 


(a)  Visualization  of  the  (2  minimization  (3.3.2).  (b)  Visualization  of  the  l\  minimization  (3.3.1). 

Figure  3.1:  An  illustrative  example  indicating  the  advantage  of  minimization 
over  £ 2  minimization  in  finding  a  sparse  point  in  the  line  f  =  t&cF .  The  £2- 
minimization  (3.3.2)  finds  a  non-sparse  point  which  is  the  intersection  of  the  £2  ball, 
and  the  line  /  =  whereas  the  A -minimization  (3.3.1)  finds  a  sparse  point  which 
is  the  intersection  of  the  l\  diamond,  and  the  line  /  =  $0:'. 

minimization  is  better  able  to  select  a  sparse  solution  of  /  =  $0:',  than  is  i2- 
minimization.  Here  our  feasible  set  S  =  {cF  :  f  =  $cF}  is  a  line  in  the  plane, 
and  the  analytical  solution  to  the  (^-minimization  problem 

minimize  || ck' || 2  (3.3.2) 

subject  to  f  =  fhaf, 

(which  is  ct  =  &  f)  is  a  point  in  this  line  that  has  the  closest  Euclidean  distance 
to  the  origin.  This  point  can  be  found  by  blowing  up  a  circle  (the  £2  ball)  until  it 
contacts  the  line  /  =  (hcF.  However,  as  indicated  in  Figure  3.1(a)  this  closest  point 
will  live  away  from  the  coordinate  axes,  and  hence  will  not  be  sparse.  In  contrast, 
the  £\  ball  in  Figure  3.1(b)  has  points  aligned  with  the  coordinate  axes.  Therefore, 
when  the  £\  ball  is  blown  up,  it  will  first  contact  the  line  /  =  at  a  point  near 
the  coordinate  axes,  that  is  it  finds  a  sparse  solution. 

Thus  far  we  have  assumed  that  the  vector  /  can  be  exactly  represented  by  a  sparse 
linear  combination  of  the  columns  of  <f>.  However,  in  reality  and  in  many  statistics 
and  machine  learning  applications  f  can  only  be  well  approximated  by  a  sparse  linear 
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combination  of  the  columns  of  <f>.  Over  time,  the  0 -minimization  approach  has  been 
generalized  to  address  this  more  general  problem. 

Examples  of  these  generalizations  are,  Least  Absolute  Shrinkage  and  Selection  Oper¬ 
ator  (LASSO)  [233,  106],  which  solves  the  A -regularized  regression  problem 


minimize  a/||/  —  +  A||a/||i,  (3.3.3) 

for  a  proper  regularization  parameter  A,  the  Basis  Pursuit  Denoising  program  [244, 
243],  which  solves  the  second  order  cone  program 

minimize  ||o:/||i  (3.3.4) 

subject  to  ||/  —  #o:/||2  <  £%, 

for  some  appropriately  chosen  £i,  and  the  Dantzig  Selector  program  [58],  which  solves 
the  linear  optimization 

minimize  ||cP||i  (3.3.5) 

subject  to  ||<E>t(/  -  <  £2, 

for  another  suitably  chosen  parameter  £2. 

After  the  formalization  of  A -minimization,  several  algorithms  were  proposed  to  effi¬ 
ciently  solve  the  above  optimization  programs.  Examples  of  such  algorithms  are  the 
interior  point  methods  [51,  52],  Lasso  modification  to  LARS  [106,  171],  homotopy 
methods  [99],  weighted  least  squares  [163],  and  gradient-based  methods  [111,  257,  27, 
242], 

Experience  with  many  applications  has  confirmed  that  A -minimization  algorithms 
and  their  extensions  [59,  168]  can  robustly  fold  sparse  features  that  closely  approxi¬ 
mate  the  target  vector  f.  Therefore,  l\  minimization  appears  to  be  a  suitable  algo¬ 
rithm  for  Task  (T2)  of  compressed  sensing.  A  natural  question  that  now  comes  to 
mind  is  “what  are  proper  sensing  matrices  for  which  the  i\  minimization  is  guaranteed 
to  recover  a  sparse  vector,  and  among  these  sensing  matrices,  for  what  matrices  is 
uniqueness  in  sparse  recovery  guaranteed?” 

To  answer  this  question,  recall  that  Proposition  3.1  implies  that  if  $  is  a  sensing 
matrix  for  which  A -minimization  is  able  to  recover  any  fc-sparse  vector  a*  from 
/  =  $«*,  then  no  2 /.’-sparse  vector  can  be  in  the  null-space  of  <£>.  The  following 
property  is  a  stricter  requirement  for  sensing  matrices,  and  is  introduced  by  Candes 
and  Tao  [56]: 

Definition  3.5  (Restricted  Isometry  Property).  Let  $  be  an  M  x  N  sensing  matrix. 
Then  for  every  integer  k,and  0  <  e  <  1,  satisfies  the  (k,e)  Restricted  Isometry 
Property  (abb.  $  is  ( k1e)-RIP '),  if  for  every  k-sparse  vector  a  we  have 

(1  -  e) ||ck||2  <  ||<&a||2  <  (1  +  e) ||^ck||2- 
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Note  that  any  sensing  matrix  which  is  (2 k,  e)-RIP  for  some  positive  e,  also  satisfies 
the  requirement  of  Proposition  3.1.  To  see  this,  suppose  oc  is  a  2/c-sparse  vector  in 
the  null-space  of  <h.  Then  from  the  RIP  with  have 

(1  -  e) ||ck||2  <  ||^a||2  =  0, 

and  therefore  we  must  have  || ok ||2  =  0.  In  other  words,  the  (2k,  e)-RIP  guarantees  that 
no  two  fc-sparse  vectors  can  be  mapped  to  the  same  low-dimension  vector.  That  is, 
compressed  sensing  is  information  theoretically  possible  using  any  (2k,  e)-RIP  sensing 
matrix  . 

The  following  celebrated  results  of  Candes,  Romberg  and  Tao  [54,  55],  and  Donoho 
et  ah  [95,  97]  also  guarantee  that  if  a  sensing  matrix  is  (2k,  e)-RIP  for  suffi¬ 
ciently  small  e,  then  C -minimization  can  exactly  recover  any  k- sparse  vector,  in  the 
noiseless  compressed  sensing  framework,  and  moreover,  the  Basis  Pursuit  Denoising 
algorithm  can  stably  approximate  any  fc-sparse  vector  in  the  presence  of  data-domain 
and  measurement- domain  noise.  This  means  that  compressed  sensing  is  also  com¬ 
putationally  possible  using  RIP  sensing  matrices  and  the  generic  7] -minimization 
algorithm. 

Theorem  3.6  (Noiseless  Compressed  Sensing  [57,  95]).  Let  &  be  a  sensing  matrix 
satisfying  (2k,  0. 41) -RIP.  For  every  k-sparse  vector  a*,  let  f  =  T>a*,  and  let  6t  be 
the  solution  of  the  I\-minimization  problem 

minimize  || ck7!! i 
subject  to  f  =  thch. 


Then  ct  =  ot*. 


Theorem  3.7  (Stable  Compressed  Sensing  [55,  97]).  Let  e  be  a  positive  number 
smaller  than  0.3,  and  let  be  a  sensing  matrix  satisfying  (2k,  e) -RIP.  Let  a*  be  any 
arbitrary  vector  in  M.N ,  and  let  hR/a:*)  denote  the  best  k-term  approximation  of  a* 
defined  by  Equation  (2.1.1).  Finally  let  ew  be  an  arbitrary  noise  vector  in  Mm,  and 
let  f  =  <haT  +  eM-  Then  the  solution  6t  of  the  Basis  Pursuit  Denoising  problem 

minimize  ||cd||i 

subject  to  ||/  —  $a,||2  <  ||ejw||2, 

satisfies  the  following  sparse  approximation  guarantee: 

,,  „„  ||a*  ~  Hfc(a*)||i  ,,  , 

||a  -  ol  ||2  <  ci- - j= - hc2||eM||2,  (3.3.6) 


mth  ci  =  and  c2  =  pfcCl- 
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Table  3.1:  Summary  of  the  ft] -minimization  problems  used  in  RIP-based  compressed 
sensing.  In  the  deterministic  noise  model  no  assumption  is  made  regarding  the  noise, 
whereas  in  the  stochastic  noise  model  the  noise  vector  is  assumed  to  be  white  Gaus¬ 
sian. 


Optimization 

Objective 

Noise  Model 

Basis  Pursuit  (BP)  [73] 

minimize  || ck' || i 
s.t.  t&cd  =  / 

No  noise 

Basis  Pursuit  Denoising 
(BPDN)  [244] 

minimize  a7|  i 
s.t.  ||/  -  #a,||2  <  £ 

Deterministic 

Noise 

LASSO  [106] 

minimize  |  —  /|||  +  A|  a7|  i 

Stochastic 

Noise 

Dantzig  Selector  (DS)  [58] 

minimize  ||a:,||1 
s.t.  || $(/  -  ^aOHoo  <  £ 

Stochastic 

Noise 

Theorems  3.6  and  3.7  are  fundamental  as  they  provide  sufficient  conditions  a  sens¬ 
ing  matrix  should  satisfy  to  provably  guarantee  that  tractable  7 1 -minimization  can 
uniquely  recover  any  sparse  vector  in  the  noiseless  sensing  regime  [57,  49,  95]  and  can 
robustly  find  a  sparse  approximation  to  any  vector  in  the  presence  of  noise  [54,  55,  97]. 
Table  3.1  summarizes  the  main  7 ] -minimization  problems  that  are  widely  used  in  com¬ 
pressed  sensing  applications. 

Remark  3.8.  There  is  a  fundamental  difference  between  the  coding  theory  approach 
and  the  statistics  approach.  In  the  coding  theory  approach,  we  first  design  a  sensing 
matrix  (e.g.,  the  Reed-Solomon  matrix)  and  then  come  up  with  a  recovery  algorithm 
specific  to  that  particular  sensing  matrix  (e.g.,  the  algebraic  decoding).  In  contrast, 
in  the  statistics  approach  we  start  with  the  generic  ^-minimization  algorithm,  and 
then  find  specific  properties  (e.g.  RIP)  a  sensing  matrix  should  satisfy,  so  that  the 
fidelity  of  the  ^-minimization  is  guaranteed.  Examples  of  RIP  sensing  matrices  are 
introduced  in  Section  S.f. 


3.3.2  Greedy  and  Iterative  Algorithms 

So  far  we  have  seen  that  if  a  sensing  matrix  satisfies  the  (2k,  e)-RIP  with  sufficiently 
small  e,  then  7] -minimization  methods  can  stably  recover  any  sparse  vector.  However, 
the  best  known  running  time  for  ^-minimization  algorithms  is  0(N1'5M2)  [205]  which 
is  infeasible  for  many  practical  applications.  These  include  important  examples  such 
as  medical  imaging  or  data  streaming,  where  the  number  of  pixels  in  the  images  or 
the  traffic  table  sizes  (N)  are  in  the  range  10'  to  109.  In  these  applications,  more 
efficient  and  scalable  algorithms  are  needed. 
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Greedy  algorithms  [237]  provide  an  alternative  to  the  G -minimization  approach.  They 
aim  to  directly  solve  the  original  Ip-minimization  problem.  Like  the  ^-minimization 
techniques,  greedy  algorithms  were  also  developed  before  the  formulation  of  RIP  [85] . 
The  greedy  algorithms  were  initially  developed  as  heuristic  algorithms  for  approxi¬ 
mately  solving  the  non-convex  optimization  problem 

minimize  Wf-Qct'Wl  (3.3.7) 

subject  to  ol'  G  £&, 

where  f  is  an  arbitrary  vector  in  is  an  M  X  N  sensing  matrix,  and 

=  {a'  :  || ck7 || o  <  k}, 

is  the  union  of  all  (^)  fc-dimensional  subspaces  in  WN .  Since  £*,  is  non-convex,  the 
optimization  problem  of  Equation  (3.3.7)  is  not  convex.  Moreover,  it  is  possible  to 
show  that  (3.3.7)  is  in  general  NP-hard  ,  that  is,  if  there  is  no  restriction  on  the 
sensing  matrix  $  [197]. 

Here  we  start  with  the  Iterative  Hard  Thresholding  (IHT)  Algorithm  [35,  114]  which 
is  the  simplest  algorithm  for  solving  the  optimization  problem  of  Equation  (3.3.7). 
IHT  can  be  viewed  as  a  special-case  of  the  more  generic  Gradient  Projection  algorithm 
[105,  130]  which  is  widely  used  in  machine  learning  [37]  and  optimization  [179]. 

Let  C  :  — >  M  be  a  differentiable  loss  function,  and  let  H  be  a  subset  of  M.N  such 

that  for  every  v  G  it  is  possible  to  efficiently  fold  the  solution  of  the  optimization 
problem 

min  ||u  —  itH,. 

The  Gradient  projection  algorithm  is  a  simple  and  generic  but  powerful  algorithm  for 
solving  the  optimization  problem 


minimize  C{ol')  (3.3.8) 

subject  to  a!  G  H. 

It  starts  from  an  arbitrary  point  a:0  G  0,  and  iteratively  takes  a  step  of  length  r/  along 
the  gradient.  Therefore,  at  each  iteration  we  have  one  gradient  update 

(3*  =  a*”1  -  jC(ctt~1) 

to  reduce  the  value  of  the  loss  function,  and  then  we  need  one  projection  step,  i.e. 
projecting  back  the  result  to  the  set  fl  by  efficiently  calculating  the  vector  cP  = 
minQ,/en  ||cP  —  /3*||2,  in  order  to  ensure  that  cP  is  also  in  the  feasible  set  H. 

Here  our  loss  function  is  the  square  loss  £(«')  =  \\f  —  tha'I)2,  with  V£(ck')  = 
<&T(<ha'  -  /).  Also  the  feasible  set  H  is  the  set  of  all  fc-sparse  vectors  (£*,).  As 
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Algorithm  1  Iterative  Hard  Thresholding  Algorithm 

Inputs:  M-dimensional  vector  / ,  M  x  N  matrix  T.  number  of  iterations  T,  the 
sparsity  level  k,  and  update  rate  77. 

Output:  TV-dimensional  vector  6t 
Initialize  at0  =  On- 
for  t  —  1, . . . ,  T  do 

Let  r*  =  $T($o:t_1  —  /).  (Gradient  step  calculation) 

Let  /3*  =  cP_1  —  rjr*.  (Gradient  update) 

Set  of  =  H/c(/3t).  (Projection  back  to  £&) 

end  for 

Output  ol  =  aT . 


a  result,  it  follows  from  Equation  (2.1.1)  that  the  projection  step  is  just  a  hard 
thresholding  operation 

a*  =  min  I  of  —  /3*  II 2  =  min  1 1  a'  —  /3t||2  =  Hfc(/3*), 

and  can  be  computed  efficiently  in  time  0(N  log  N)  by  sorting  the  elements  of  /3*. 

Algorithm  1  summarizes  the  Iterative  Hard  Thresholding  (IHT)  algorithm  for  solving 
the  optimization  problem  of  Equation  (3.3.7).  As  mentioned  earlier,  this  algorithm 
has  been  invented  independently  by  several  researchers  in  different  communities  as  a 
heuristic  algorithm  for  solving  Equation  (3.3.7)  [85,  36,  147].  It  turns  out  that  the 
Restricted  Isometry  Property,  which  is  a  sufficient  condition  for  the  fidelity  of  the 
G -minimization  algorithms  was  also  sufficient  for  proving  the  quick  convergence  of 
the  IHT  algorithm  to  the  optimal  solution  of  Equation  (3.3.7)  [36,  119]. 

The  analysis  that  we  discuss  here  is  due  to  Garg  and  Khandekar  [119],  and  provides 
a  near  linear-time  algorithm  that  is  guaranteed  to  find  the  solution  of  the  program  of 
Equation  (3.3.7)  as  long  as  the  sensing  matrix  is  (2k,  y3)-RIP. 

In  their  analysis,  they  first  show  that  the  loss  function  C(of)  always  decreases  by  a 
constant  factor  at  the  end  of  every  iteration.  To  prove  this  they  show  that  the  gradient 
descent  step  reduces  the  error  significantly  enough,  while  the  RIP  of  $  implies  that 
the  sparsification  step  does  not  increase  the  error  by  too  much.  The  following  theorem 
summarizes  the  £2/^i  sparse  approximation  guarantee  of  the  IHT  Algorithm. 

Theorem  3.9.  Let  a*  be  an  arbitrary  vector  in  ,  and  let  eM  be  the  arbitrary 
noise  vector  in  Define 

SNR  =  7 -  ■ 

HejwIG  +  ||«*  —  Hfc(a*)||2 

Let  #  be  an  M  x  N  matrix  satisfying  (2k,  e) -RIP  with  e  <  |.  Finally  let  f  = 
+  eM-  Then  there  exists  a  constant  ciht  >  0  that  only  depends  on  e,  such  that 
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Algorithm  1  with  rj  =  computes  a  k-sparse  vector  a  satisfying 


+  ||ejvf||2J 

in  at  most  O  (log  SNR)  iterations.  Moreover,  each  iteration  requires  only  0(V)  oper¬ 
ations,  where  V  bounds  the  cost  of  a  matrix-vector  multiplication  by  $  or  & . 


a 


HHcP 


<  CfflT  f  \/l  + 


a 


HHa* 


+ 


\a*-Kk(a*) 
a /k 


The  Iterative  Hard  Thresholding  Algorithm  is  a  first  order  algorithm  that  solves 
the  ^-minimization  problem  of  Equation  (3.3.7).  By  first  order  we  mean  at  each 
iteration  the  algorithm  only  requires  two  matrix  vector  multiplications  <&cP  1  and 
$T(/-$ai-1).  Therefore,  the  algorithm  can  be  implemented  easily  and  efficiently. 
Nevertheless,  the  IHT  algorithm  usually  performs  significantly  worse  than  i\  mini¬ 
mization  (see  Figure  3.2  or  [185]  for  more  discussions). 

In  order  to  overcome  the  sub-optimality  of  the  IHT  algorithm  compared  to  convex, 
more  complicated  greedy  algorithms  were  proposed  over  the  years.  The  Compres¬ 
sive  Sampling  Matching  Pursuit  (CoSaMP)  algorithm  combines  the  idea  of  greedy 
gradient-projection  with  the  idea  of  using  convex  optimization  methods  for  sparse 
approximation,  with  the  aim  of  achieving  a  high-performance,  computationally  ef¬ 
ficient  algorithm  [199].  CoSaMP  is  an  iterative  algorithm  that  relics  on  two  stages 
of  sparse  approximation:  a  first  stage  selects  an  enlarged  candidate  support  set  in  a 
similar  fashion  to  the  IHT  algorithm,  while  a  second  stage  projects  down  this  initial 
approximation  to  the  desired  sparsity  level. 

Similar  to  the  IHT  algorithm,  at  the  begining  of  every  iteration  the  gradient  vector 
r*  =  <&T(<l?cP_1  —  /)  is  calculated.  In  IHT  then  this  gradient  is  directly  added  to 
the  previous  candidate  ot t~1  in  order  to  obtain  the  new  candidate  of  .  However,  in 
contrast  to  IHT  algorithm,  CoSaMP  is  not  first  order.  Here  first  the  support  of 
the  significant  entries  of  the  gradient  vector  r*  is  first  added  to  the  support  of  the 
previous  candidate  cP_1,  with  the  goal  of  obtaining  a  richer  set  fP  of  the  columns 
of  the  sensing  matrix  that  best  represents  the  vector  f.  The  new  candidate  is  then 
a  | fP  (-sparse  vector,  supported  on  fP  whose  non- zero  entries  that  are  obtained  by 
solving  the  optimization  problem 

minimize  ^ -  /111, 

(i.e. ,  by  projecting  f  into  the  span  of  <&Qi.)  Similar  to  IHT,  the  new  candidate  is 
finally  further  sparsified  to  ensure  that  it  belongs  to  the  feasible  set  E*,.  The  algorithm 
is  formally  detailed  as  Algorithm  2. 

The  following  theorem  is  proved  by  Needcll  and  Tropp  [199],  and  provides  an  £2/f?i 
sparse  approximation  guarantee  for  CoSaMP. 

Theorem  3.10.  Suppose  that  is  an  M  x  N  sensing  matrix  which  is  (4/c,  0.1)-RIP. 
Let  f  =  +  ejvf  be  a  vector  of  samples  of  an  arbitrary  signal,  contaminated  with 
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Algorithm  2  CoSaMP  Algorithm 

Inputs:  M-dimensional  vector  f,MxN  matrix  $,  number  of  iterations  T,  and  the 
sparsity  level  k. 

Output:  AT-dimensional  vector  6t 

Initialize  a0  =  0 m- 
for  t  —  1,  •  •  •  ,  T  do 

Let  r*  =  <&T(<&cP  1  —  /). 

(Gradient  step  calculation) 

Let  0  =  Supp(H2A,(rt)). 

(Identify  large  components) 

Let  0*  =  Cl  U  fP-1. 

(Enlarging  the  candidate  set) 

Let  /3*nt  =  &ntf,  and  (3*^  =  0. 

(Signal  estimation  by  least  squares) 

Let  cP  =  Hfc  (/?*). 

end  for 

Output  ct  =  ctT . 

(Projection  back  to  £*,.) 

arbitrary  noise.  Define 


SNR  = 


iHfcfa" 


lleA^I|2 

Then  the  algorithm  CoSaMP  produces  a  k-sparse  approximation  ct  that  satisfies 


(3.3.9) 


ct 


a  2  <  20  \\ct 


Hfc  a*  2  + 


\fk 


+  FM  2 


in  at  most  O  (log  SNR)  iterations.  Moreover,  each  iteration  requires  only  0(V)  oper¬ 
ations,  where  V  bounds  the  cost  of  a  matrix-vector  multiplication  by  $  or  & . 


Examples  of  other  greedy  algorithms  include  the  classical  Matching  Pursuit  (MP) 
[186],  Orthogonal  Matching  Pursuit  (OMP)  [241,  126],  stagewise  OMP  (StOMP)  [96], 
regularized  OMP  (ROMP)  [200],  subspace  pursuit  [82],  Rerative  Soft  Thresholding 
(1ST)  [107],  and  SAMP  [89].  A  comparison  of  a  few  key  greedy  algorithms  for  RIP- 
based  Compressed  sensing  is  provided  in  Table  3.2. 

Greedy  algorithms  are  favorable  for  compressed  sensing  due  to  their  computational 
efficiency  and  also  their  simplicity  of  implementation.  However,  a  major  problem 
with  most  greedy  algorithms  is  that  the  sparsity  level  k  must  be  known  to  the  user  a 
priori.  To  solve  this  difficulty  Donoho  and  Maleki  have  suggested  using  tuned  greedy 
algorithms  [185].  A  tuned  greedy  recovery  algorithm  is  a  recovery  algorithm  that  uses 
a  hard-coded  sparsity  level  k,  which  is  determined  as  a  function  of  the  data  dimension 
N ,  and  the  number  of  measurements  M.  The  user  does  not  need  to  know  this  hard¬ 
coded  number.  If  the  actual  sparsity  in  at*  is  better  than  the  assumed  value,  the 
algorithm  still  works,  but  if  the  sparsity  is  actually  worse,  the  algorithm  wont  work 
even  if  tuned  to  assume  that  worse  sparsity  level. 

The  Tuned  Two-Stage  Thresholding  (Tuned  TST)  [185]  is  then  a  generalization  of 
the  CoSaMP  algorithm  that  does  not  require  the  sparsity  level  to  be  specified  by  the 
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Table  3.2:  A  comparison  of  a  few  selected  greedy  algorithms  for  RIP-based  Com¬ 
pressed  sensing  and  the  Basis  Pursuit  Denoising  algorithm.  The  algorithms  are  ro¬ 
bust  against  noise.  Here  M  and  N  denote  the  number  of  rows  and  columns  of  matrix 
T*.  V  denotes  the  time  taken  in  performing  two  matrix  operations  4>v  and  <h  u.  Also 
SNR  is  defined  By  Equation  (3.3.9),  and  all  bounds  ignore  the  0()  constants. 


Algorithm 

Approach 

Recovery 

Time 

Recovery 

Condition 

Recovery 

Guarantee 

BPDN  [55] 

Convex 

Optimization 

iVL5M2 

$  :  (2 k,  e)-RIP 
with  e  <  a/2  —  1 

k/k 

IHT  [119] 

Greedy 

V  log  SNR 

$  :  (2k,  e)-RIP 
with  e  <  | 

k/k 

IHT  [36] 

Greedy 

V  log  SNR 

$  :  (3 k,  e)-RIP 
with  e  < 

k/k 

Subspace  Pursuit  [82] 

Greedy 

V  log  SNR 

$  :  (3k,  e)-RIP 
with  e  <  0.06 

k/k 

SAMP  [89] 

Greedy 

V  log  SNR 

$  :  (3k,  e)-RIP 
with  e  <  0.06 

k/k 

CoSaMP  [199] 

Greedy 

V  log  SNR 

$  :  (4k,  e)-RIP 
with  e  <  0.1 

k/k 

user.  It  has  been  reported  that  the  Tuned  TST  algorithm  empirically  outperforms  the 
original  CoSaMP  algorithm  [185].  Therefore,  in  the  rest  of  this  thesis,  unless  specified 
explicitly,  we  use  the  tuned  TST  algorithm  as  the  baseline  greedy  algorithm. 

Theorems  3.9  and  3.10  provided  theoretical  k/k  sparse  approximation  bounds  on 
the  performances  of  IHT  and  CoSaMP  algorithms,  which  are  similar  to  the  k/k 
guarantee  of  A -minimization  methods  (Theorem  3.7).  However,  a  good  asymptotic 
theoretical  bound  is  not  very  useful  if  the  runtime  constants  are  very  big.  From 
a  practical  perspective,  it  is  very  important  to  quantify  the  exact  reconstruction 
accuracy  of  the  proposed  greedy  algorithms,  and  in  particular  to  determine  how  well 
each  greedy  algorithm  performs  compared  to  the  A -minimization  approach. 

To  see  how  each  greedy  algorithm  compares  to  the  A -minimization  empirically,  the 
following  Monte  Carlo  simulations  is  suggested  by  Donoho  and  Tanner  [98]  (see  also 
[185]).  Fix  the  signal  dimension  N  =  800,  and  sweep  across  k  and  M  values.  For  each 
(k,M)-pair,  repeat  the  following  100 -times:  (i)  generate  a  k-sparse  vector  a*  with 
random  support,  random  sign,  and  unit  norm,  (ii)  generate  compressive  measurements 
(no  noise)  using  a  RIP  sampling  matrix,  and  (in)  recover  a  k-sparse  approximation 
6t  for  ol*  using  each  greedy  algorithm.  Finally  report  the  number  of  recoveries  that 
obtain  reconstruction  error,  ||a:*  —  <3k || 2  less  than  10~2. 
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Comparison  of  Different  Algorithms 


Figure  3.2:  Phase  Transitions  of  several  baseline  sparse  approximation  algorithms  as 
provided  in  [185].  The  upper  curve  indicates  the  theoretical  phase  transition  of  the 
0 -minimization  (which  is  characterized  by  Donoho  and  Tanner  [98]),  and  the  lower 
curves  show  the  empirical  phase  transitions  of  different  algorithms. 


Figure  3.2  compares  the  empirical  performance  of  different  sparse  reconstruction  al¬ 
gorithms.  The  curve  corresponding  to  each  algorithm  shows  the  Phase  Transition 
of  that  algorithm.  Bellow  the  phase  transition  curve,  the  algorithm  works  well  and 
above  that  curve  it  fails;  the  transition  zone  is  narrow,  and  gets  better  defined  at 
large  problem  sizes  N  (see  [185]  for  further  discussion). 

As  shown  in  Figure  3.2  the  ^-minimization  methods  (e.g.  LARS)  always  outperform 
the  greedy  algorithms  in  terms  of  the  maximum  sparsity  level  k  that  can  be  recovered 
using  the  algorithm.  On  the  other  hand,  the  greedy  algorithms  are  typically  signif¬ 
icantly  faster.  In  other  words,  there  is  always  a  trade-off  between  the  performance 
and  efficiency  of  the  two  approaches.  If  the  sparsity  value  (k)  is  not  too  large,  it 
is  more  beneficial  to  use  fast  greedy  algorithms.  On  the  other  hand,  if  the  sparsity 
value  is  higher  than  a  threshold,  then  it  is  more  performant  to  use  the  0 -minimization 
methods  to  increase  the  chance  of  successful  recovery. 
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3.3.3  Bayesian  Compressive  Sensing 


Bayesian  compressed  sensing  provides  a  third  approach  for  solving  the  robust  com¬ 
pressed  sensing  problem  [160,  4,  194],  In  Bayesian  compressed  sensing  it  is  often 
assumed  a  priori  that  the  unknown  sparse  vector  oc*  is  sampled  from  a  distribution 
that  favors  sparse  vectors,  and  the  noise  vector  om  is  sampled  from  some  stochastic 
distribution  (e.g.  the  multivariate  Gaussian  distribution).  The  main  goal  is  then  to 
estimate  the  hidden  parameters  of  the  underlying  distributions  through  a  maximum 
a  posteriori  (MAP)  optimization  in  order  to  completely  identify  the  posterior  density 
function  of  oc*  [34], 

More  precisely,  given  the  measurement  vector  f  £  Mm,  the  goal  is  find  an  estimate  oc 
that  maximizes  the  posterior  probability  logPr[d|/].  It  follows  from  Bayes  rule  that 


(3.3.10) 


argrnax  (Prf/lcL]  Pr[cP]) , 

a' 


where  the  last  equality  follows  from  the  fact  that  f  is  already  observed,  and  therefore 
Pr[/]  is  a  constant  independent  of  oc' . 

The  conditional  distribution  Pr[/ |cP]  models  the  noise  process.  The  simplest  noise 
model  assumes  that  the  measurement  noise  is  white  Gaussian  of  mean  Om  and  vari¬ 
ance  a2M\ mxm,  therefore  we  have 


(3.3.11) 


The  prior  distribution  PrfcP]  models  the  prior  knowledge  about  the  vector  oc' .  In 
Bayesian  compressed  sensing,  prior  distributions  that  give  more  weight  to  sparse 
vectors  are  of  more  interest.  A  widely  used  sparseness  prior  is  the  Laplace  density 
function  with  zero  mean  and  variance  adI nxn  [32]: 


By  choosing  the  Laplace  distribution  as  the  prior,  and  the  white  Gaussian  distribution 
for  modeling  the  noise  we  have 


6c  =  argmax  (Pr[/ 1 oc]  PrjcL])  (3.3.12) 

c\' 


OC 


arg  max  log  (Pr  [/ 1  oc]  Pr  [a7] ) 

rv' 


argn 
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The  optimization  problem  of  Equation  (3.3.12)  is  a  LASSO  optimization  problem 

(Equation  3.3.3)  with  parameter  A  =  - — .  This  provides  another  angle  on  why 

0 -minimization  methods  are  attuned  to  sparse  approximation.  As  Equation  (3.3.12) 
indicates,  the  solution  of  the  LASSO  optimization  with  a  suitable  regularizing  pa¬ 
rameter  A  is  indeed  the  solution  of  a  MAP  optimization  problem  with  Laplace  prior 
and  white  Gaussian  noise. 

Since  solving  the  map  optimization  problem  maximize^  log  Pr[o:,|/]  with  a  Laplace 
prior  is  equivalent  to  solving  a  LASSO  optimization,  it  also  inherits  the  0(M2N1'5) 
computational  complexity  of  solving  the  LASSO  optimization  [205].  To  overcome  this 
computational  difficulty,  Donoho,  Maleki,  and  Montanari  proposed  the  Approximate 
Message-Passing  (AMP)  algorithm  for  approximately  solving  the  LASSO  problem  [91, 
94],  Empirical  observations  suggest  that  the  solution  of  the  AMP  Algorithm  quickly 
converges  to  the  optimal  solution  of  LASSO  [92,  93].  However,  the  convergence  of  the 
AMP  has  only  been  proved  for  sensing  matrices  obtained  from  Gaussian  distributions 
[91,  25]  which  suffer  from  computation  and  storage  limitations,  (see  Section  3.4.1  for 
further  discussions).  Proving  the  convergence  of  the  AMP  algorithm  for  efficient 
sensing  matrices  is  an  interesting  and  important  open  problem. 

Another  approach  to  overcome  this  computational  difficulty  is  to  use  other  sparsity- 
promoting  prior  distributions,  so  that  the  map  optimization  problem  can  be  solved 
efficiently  using  the  standard  Bayesian  optimization  methods  including  the  Markov 
Chain  Monte  Carlo  method  [127],  and  the  Variational  Inference  method  [248]. 

Ji,  Xue,  and  Garin  [160]  have  addressed  this  issue  by  introducing  the  Bayesian  Com¬ 
pressive  Sensing  (BCS)  algorithm,  which  uses  the  relevance  vector  machine  (RVM)  for 
sparse  approximation  [234].  Rather  than  imposing  a  Laplace  prior  on  cP,  in  the  RVM 
a  hierarchical  prior  has  been  invoked  [121].  The  hierarchical  prior  has  similar  sparsity- 
promoting  properties  to  the  Laplace  prior  but  allows  convenient  conjugate-prior  prop¬ 
erties  which  are  useful  for  conveniently  implementing  a  Markov  Chain  Monte  Carlo 
(MCMC)  or  a  variational  Bayesian  optimization  algorithm  [198]. 

Similarly  Carmi  et.  al.  [62]  proposed  the  Approximate  Bayesian  Compressive  Sensing 
(ABCS)  algorithm  which  uses  the  semi-Gaussian  prior  distribution.  A  distinguishing 
feature  of  the  semi-Gaussian  distribution  is  greater  concentration  in  the  vicinity  of 
the  origin,  which  promotes  sparsity  more  aggressively  than  ^-minimization. 

Even  though  simulation  results  indicate  that  the  BCS  and  ABCS  algorithms  have 
good  performance  and  in  some  cases  even  outperform  the  G -minimization  method, 
they  both  have  computational  complexity  0(NM2)  which  is  still  inefficient  for  many 
compressed  sensing  applications  with  N  &  109,  and  M  ~  106. 

One  approach  which  can  further  reduce  the  computational  complexity  of  the  sparse 
recovery  phase  in  Bayesian  compressed  sensing  is  to  use  the  efficient  Belief  Propaga¬ 
tion  Algorithms  [182],  Belief  Propagation  is  a  fast  message-passing  algorithm  that  has 
been  extensively  used  for  efficiently  decoding  the  Low- Density  Parity  Check  (LDPC) 
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Table  3.3:  Summary  of  the  main  algorithms  in  the  Bayesian  compressed  sensing 
framework. 


Algorithm 

Signal 

Prior 

Sensing 

Matrix 

Recovery 

Time 

Convergence 

Guarantee 

LASSO  [233] 

Laplace 

RIP 

0(N15M2) 

Yes 

BCS  [160] 

Hierarchical 

prior 

RIP 

0(NM2) 

Yes 

ABCS  [62] 

semi-Gaussian 

RIP 

0(NM 2) 

Yes 

CS-BP  [23] 

Mixture  of 
two  Gaussians 

LDPC 

O(AGog2A0 

No 

SuPrEM  [4] 

Gaussian-scale 

mixtures 

LDPC 

O(N) 

No 

codes  [227].  The  fundamental  connection  between  compressed  sensing  and  the  theory 
of  error-correcting  codes  suggests  the  idea  of  adopting  BP  to  solve  the  compressed 
sensing  problem. 

Recently,  there  have  been  several  papers  on  using  Belief  Propagation  algorithms  for 
sparse  recovery.  In  [222,  23],  the  authors  introduced  the  belief  propagation  approach 
to  compressive  sensing,  and  applied  it  to  the  recovery  of  random  signals,  modeled 
by  a  two-state  mixture  of  Gaussians  (with  more  weight  on  the  narrower  Gaussian  to 
promote  the  sparsity).  Their  proposed  Compressive  Sensing  Belief  Propagation  (CS- 
BP)  algorithm  has  0(N  log2  N)  computational  complexity  and  is  significantly  faster 
than  BCS  and  ABSC.  In  a  more  recent  paper,  Akcakaya,  Park,  and  Tarokh  [4]  used 
belief  propagation  on  signals  modeled  as  Gaussian-scale  mixtures.  Their  proposed 
Sum  Product  with  Expectation  Maximization  (SuPrEM)  algorithm  has  O(N)  running 
time,  and  is  shown  to  have  an  excellent  empirical  performance. 

Nevertheless,  the  major  problem  with  the  belief  propagation  approach  is  that  nei¬ 
ther  CS-BP  nor  SuPrEM  is  guaranteed  to  converge.  In  contrast  to  the  other  sparse 
reconstruction  algorithms,  much  less  is  known  about  the  theoretical  performance  of 
the  CS-BP  and  SuPrEM  algorithms.  Analyzing  the  convergence  rates  of  the  Belief 
Propagation  algorithms  is  an  interesting  and  important  open  problem  in  Bayesian 
compressed  sensing.  Table  3.3  summarizes  a  comparison  of  different  Bayesian  com¬ 
pressive  sensing  algorithms. 
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3.4  Construction  of  RIP  Sensing  Matrices 

3.4.1  Random  RIP  Constructions 

Thus  far  we  have  seen  that  if  a  matrix  is  (2 k,  e)-RIP  for  sufficiently  small  e,  then 
the  ^-minimization,  greedy,  and  Bayesian  algorithms  can  stably  approximate  any 
sparse  vector  cx*  from  the  low-dimensional  vector  /.  Therefore,  the  problem  of  stable 
compressed  sensing  is  now  reduced  to  the  problem  of  finding  RIP  sensing  matrices. 
Here  we  provide  examples  of  sensing  matrices  satisfying  the  RIP. 

Definition  3.11  (Gaussian  sensing  matrix  ).  A  Gaussian  sensing  matrix  ,  is  an  M  x 
N  matrix  whose  entries  are  sampled  independently  and  identically  from  a  J\f  (0,  jj) 
distribution. 

Definition  3.12  (Rademacher  sensing  matrix  ).  A  Rademacher  sensing  matrix  is  an 
M  x  N  matrix  such  that  each  entry  of  it  is  assigned  to  be  each  with  probability 

1 

2  ’ 

The  following  theorem  by  Baraniuk  et  al.  [21]  shows  that  Gaussian  and  Rademacher 
processes  generate  M  x  N  matrices  that  satisfy  the  RIP  with  high  probability: 

Theorem  3.13.  Suppose  that  M,N,  and  0  <  e  <  1  are  given.  Let  $  be  an  M  x 
N  Rademacher  (or  Gaussian)  sensing  matrix  .  Then  there  exist  absolute  constants 
ci,c2  >  0,  such  that  $  is  (. k,e)-RIP  for  any  k  <  IOg ^J/h)  w%th  probability  at  least 
l-2exp  {-c2e2M}. 

Theorem  3.13  is  of  particular  interest  as  it  concludes  that  stable  compressed  sensing 
is  possible  using  random  Gaussian  or  Rademacher  sensing  matrices  combined  with 
G -minimization.  Moreover,  only  O  (klogN/k)  measurements  are  required  in  order  to 
be  able  to  successfully  recover  any  fc-sparse  vector.  As  we  will  see  in  Section  3.5 
at  least  G  (k\ogN/k)  measurements  are  always  required  to  have  stable  compressed 
sensing,  and  therefore  Gaussian  and  Rademacher  matrices  are  optimal  with  respect 
to  the  number  of  required  measurements. 

However,  there  is  no  efficient  algorithm  to  verify  whether  a  given  random  Gaus¬ 
sian  or  Rademacher  matrix  satisfies  the  RIP  or  not.  Moreover,  since  Gaussian  and 
Rademacher  matrices  do  not  have  any  structure,  memory  is  an  issue  since  Q(MN) 
bits  are  required  to  store  the  whole  matrix.  Moreover,  due  to  the  lack  of  structure  of 
these  matrices,  any  matrix-vector  multiplication  requires  Q(MN)  operations  which 
makes  the  encoding  less  efficient. 

To  overcome  the  difficulties  of  using  Gaussian  or  Rademacher  sensing  matrices,  al¬ 
ternative  RIP  matrices  have  been  introduced. 

Definition  3.14  (Subsampled  unitary  matrices).  Let  U  be  any  N  x  N  unitary  ma¬ 
trix.  Choose  a  subset  fl  of  cardinality  Q  —  M  uniformly  at  random  from  the  set 


{1,  -  -  -  ,N}.  Let  be  the  M  x  N  matrix  obtained  by  sampling  M  rows  of  U  cor¬ 
responding  to  the  indices  in  LI  and  renormalizing  the  resulting  columns  so  that  they 
have  unit  l^- norms .  Then  $  is  an  M  x  N  subsampled  unitary  matrix. 


The  following  theorem  indicates  that  as  long  as  M  is  sufficiently  large,  any  subsampled 
unitary  matrix  is  RIP  with  overwhelming  probability. 

Theorem  3.15  ([218]).  For  each  integer  N,  k,  and  for  any  t  >  1  and  any  e  G  (0, 1), 
let 

M  >  c3N W^Wlckt  log4  N, 

then  the  subsampled  matrix  $  is  (k,  e)-RIP  with  probability  exceeding  1—10  exp  {— c4e2f}  , 
where  c3  and  C4  are  absolute  constants  that  do  not  depend  on  M,N,ork. 

The  following  corollary  follows  from  Theorem  3.15  by  taking  t  =  O  (PtT) : 

Corollary  3.16.  Let  U  be  any  N  x  N  unitary  matrix  whose  entries  have  magnitude 
For  each  integer  k  and  any  e  G  (0, 1),  let 

,  ,  c'nk  log5  N 

where  c3  is  an  absolute  constant.  Then  the  subsampled  matrix  $  is  (. k,e)-RIP  with 
probability  exceeding  1  —  j.. 


Partial  Hadamard  matrices  are  one  example  of  subsampled  unitary  matrices  satisfying 
the  conditions  of  Corollary  3.16. 

Definition  3.17  (Hadamard  Matrix).  The  Hadamard  transform  Hn  is  a  2n  x  2n 
matrix  that  can  be  defined  recursively  in  the  following  way:  We  define  the  lxl 
Hadamard  matrix  H0  by  the  identity  H0  —  1,  and  then  define  Hn  for  n  >  0  by: 


^  Hn—  1  Hn—  1 

y/2  Hn—  1  Hn_\ 


(3.4.1) 


Without  loss  of  generality  suppose  N  =  2n.  The  Hadamard  matrix  Hn  can  be  viewed 
as  a  symmetric  unitary  matrix  with  |  Hn  \ \  ^  Therefore,  it  follows  from  Corol¬ 

lary  3.16  that  as  long  as  M  >  c3fcl°g  N ^  an  M  x  N  subsampled  Hadamard  matrix  is 
(k,e)~  RIP. 

Note  that  in  contrast  to  random  Gaussian  and  Rademacher  matrices,  only  0(M  log  N) 
random  bits  are  required  to  store  a  partial  Hadamard  matrix.  Moreover,  the  matrix- 
vector  multiplication  Hnv  can  be  performed  efficiently  in  time  0(N  log  N)  using  the 
fast  Walsh-Hadamard  transform  [112]. 

Therefore,  partial  Hadamard  matrices  are  superior  to  random  Gaussian  or  Rademacher 
matrices  in  terms  of  the  required  storage  and  computational  time  of  calculating 
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matrix-vector  multiplications.  The  main  drawback  of  partial  Hadamard  matrices  is 
sub-optimality  in  the  number  of  required  measurements  for  satisfying  the  RIP.  With 
Gaussian  and  Rademacher  matrices  only  O  measurements  are  required  to 

guarantee  the  RIP  with  overwhelming  probability,  whereas  O  ^fcl°s2  N^j  measurements 
are  required  for  partial  Hadamard  matrices  to  have  the  same  RIP  with  the  same  prob¬ 
ability. 


3.4.2  Deterministic  RIP  Constructions 


Thus  far,  we  have  seen  Gaussian,  Rademacher,  and  the  partial  Hadamard  matrices 
as  examples  of  random  sensing  matrices  satisfying  the  RIP.  As  mentioned  earlier, 
Gaussian  and  Rademacher  matrices  suffer  from  storage  and  computational  issues,  and 
partial  Hadamard  matrices  suffer  from  sub-optimality  in  the  number  of  measurements. 
In  addition,  there  is  no  efficient  algorithm  to  verify  whether  a  random  matrix  is  RIP 
or  not.  Therefore,  it  is  desirable  to  construct  deterministic  matrices  satisfying  the 
RIP. 

Most  explicit  constructions  of  RIP  matrices  are  based  on  bounding  the  mutual  co¬ 
herence  between  the  columns  of  the  sensing  matrix  . 

Definition  3.18  (Mutual  coherence).  Let  $  be  an  M  x  N  sensing  matrix  with  nor¬ 
malized  columns.  The  mutual  coherence  between  the  columns  of  is  then  defined 
as 

T  =  max|(¥>i,^)|. 

*7=7 


The  following  lemma  connects  the  RIP  property  of  any  sensing  matrix  $  with  nor¬ 
malized  columns  to  the  mutual  coherence  of 


Lemma  3.19.  Let  $  be  an  M  x  N  sensing  matrix  with  normalized  columns  and  with 
mutual  coherence  ji.  Then  is  (. k ,  e)-RIP  with  e  —  (k  —  T)/jl. 


Proof.  Let  a.  be  any  /c-sparse  vector.  We  have 


|$a| 


I«ll2l  = 


(3.4.2) 


j¥=i 


j¥=i 

N  \  2 

<p^Wi\Wj\  =  t  -  II  ^  II 2  ]  <  Mfc  -  ^IMIL 

j¥=i 


.  2—1 


The  last  inequality  follows  from  the  Cauchy-Schwarz  inequality  and  the  fact  that  a. 
is  k- sparse: 

N  \  2 

E  Kl  =  ll«ll?  <  fc|Hl2- 


v  2=1 
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Examples  of  sensing  matrices  with  small  mutual  coherence  have  been  constructed 
by  Calderbank  et.  al.  [140],  Applebaum  et.  al.  [11],  Bajwa  et.  ah  [15],  Kashin 
[166],  Alon  et.  ah  [7],  DeVore  [88],  Iwen  [154],  and  Nelson  and  Temlyakov  [201].  All 
these  constructions  have  mutual  coherence  /i  <  ,  and  therefore  satisfy  RIP 

for*  =  0(£^O*it). 


On  the  other  hand,  the  Welch  bound  [251],  demonstrates  that  the  mutual  coherence 
of  a  matrix  with  normalized  columns  cannot  be  too  small.  More  precisely,  there  is  a 
universal  lower  bound  _ 


h  > 


log  N 


M  log  M /log  N 


> 


1 

VaT 


as  long  as  M  <  Therefore,  by  estimating  RIP  parameters  in  terms  of  the  coherence 
parameter  we  cannot  construct  M  x  N  (k,  e)-RIP  matrices  with  k  >  \/M,  and  e  <  1. 

Most  explicit  constructions  of  RIP  matrices  are  based  on  the  mutual  coherence  and 
suffer  from  the  k  =  0(y/N)  barrier,  but  a  recent  result  by  Bourgain  et.  al  [39]  uses 
the  methods  of  additive  combinatorics  to  do  slightly  better. 


Theorem  3.20  ([39]).  There  is  an  effective  constant  e0  >  0  and  an  explicit  number 
M0  such  that  for  any  positive  integers  M  >  M0,  and  M  <  N  <  M1+£°,  there  is  an 
explicit  M  x  N  matrix  with  is  ( k ,  e)-RIP,  with  k  =  M°-5+£0,  and  e  =  M~£0 . 


Table  3.4  compares  various  properties  of  different  RIP  matrices.  The  result  of  Bour¬ 
gain  et.  al.  breaks  the  bottleneck  M  =  Q(k2)  of  the  low-coherence  matrices.  However, 
it  is  still  significantly  sub-optimal  compared  to  the  M  =  0(k\ogN/k)  measurements 
of  random  sensing  matrices.  The  problem  of  finding  deterministic  RIP  matrices  with 
close  to  optimal  (M  m  k  log  f  )  number  of  measurement  is  an  important  open  problem 
in  the  theory  of  compressed  sensing. 

A  negative  result  by  Chandar,  proves  that  if  a  sensing  matrix  has  only  0, 1  entries,  or 
if  it  is  too  sparse,  then  that  matrix  cannot  satisfy  the  RIP  [70].  This  negative  result 
and  several  unsuccessful  attempts  in  designing  optimal  deterministic  or  structured 
RIP  matrices  suggests  that  maybe  the  RIP  condition  is  too  restrictive  for  compressed 
sensing.  This  is  indeed  the  main  subject  of  this  thesis,  in  which  we  show  that  it  is 
possible  to  design  deterministic  matrices  that  do  not  satisfy  the  RIP,  but  still  provide 
almost  every  feature  obtainable  from  random  RIP  matrices,  as  well  as  extra  advan¬ 
tages  which  are  not  obtainable  (or  are  in  some  cases  even  impossible)  via  the  random 
RIP  matrices.  Before  discussing  these  matrices,  we  first  investigate  the  information 
theoretic  limitations  of  the  compressed  sensing. 
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Table  3.4:  Properties  of  different  sensing  matrices  satisfying  the  (k,  e)  Restricted 
Isometry  Property.  All  bounds  ignore  the  0(.)  constants. 


Matrix 

Number  of 

measurements 

Memory 
(random  bits) 

Matrix-vector 

multiplication 

Random  vs. 
Deterministic 

Gaussian 

(Rademacher)  [21] 

H°g(f) 

MN 

MN 

Random 

Partial  Hadamard 
(Fourier)  [218] 

k log5  N 

Mlog  N 

N\ogN 

Random 

Incoherent 

Tight-Frames  [11,  15] 

k2 

- 

N\ogN 

Deterministic 

Randomness 
Extractors  [39] 

2 

k2+e  o 

- 

NM 

Deterministic 

3.5  Compressed  Sensing  Lower  Bounds 

In  Section  3.3,  we  saw  that  if  a  sensing  matrix  is  RIP,  then  it  is  possible  to  obtain 
£2/£i  sparse  approximation  guarantees  using  A -minimization  and  greedy  algorithms. 
We  have  also  seen  examples  of  RIP  matrices  with  M  =  O  (A;  log  (y-))  measurements. 
Now,  a  natural  question  that  comes  to  mind  is  “  What  kinds  of  improvements  are 
possible  over  this  existing  RIP-based  approach?'"  To  answer  this  question,  we  focus 
on  the  following  specific  questions. 

•  (Ql):  Are  M  —  Q  (/clog(^))  measurements  necessary  for  stable  compressed 
sensing? 

•  (Q2):  Is  £2/£\  the  tightest  sparse  approximation  guarantee  in  stable  compressed 
sensing?  Is  it  possible  to  derive  other  A /A,  bounds?  How  do  they  compare  to 
the  i2flx  bound? 

•  (Q3):  Is  RIP  necessary  for  stable  compressed  sensing?  Is  it  possible  to  find 
RIP-less  sensing  matrices  with  similar  (or  even  better)  performances? 

Here  we  answer  the  first  two  questions.  Answering  the  third  question  is  the  subject 
of  the  rest  of  this  thesis.  To  answer  the  first  two  questions,  we  first  define  the  best 
/c-terrn  approximation  which  is  a  fundamental  problem  in  approximation  theory  [75] , 
and  is  highly  related  to  the  sparse  approximation  problem  (Definition  3.3)  for  stable 
compressed  sensing. 

Definition  3.21  (Best  fc-term  Approximation).  Let  p  and  q  be  positive  integers.  Let 
$  be  an  M  x  N  sensing  matrix  ,  and  let  be  a  reconstruction  algorithm  associated 
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with  <f>.  Then  A&  provides  lp/£q  best  k-term  approximation  guarantee  if  and  only 
if  there  exists  an  absolute  constant  C,  such  that  for  every  vector  a*  E  RN ,  given 
f  =  <&«*,  A$  can  successfully  recover  a  k-sparse  vector  a  with 

||a*-d||p<  -^||a*-H*(a*)||,. 

k  p 

The  £p/£q  best  7-term  approximation  problem  is  a  special  case  of  the  £p/£q  sparse 
approximation  problem  defined  in  Definition  3.3.  However,  in  contrast  to  the  general 
sparse  approximation  problem,  there  is  no  measurement  noise  in  the  best  7-term 
approximation  setting.  The  best  /e-term  approximation  analyses  provide  powerful 
tools  for  proving  lower-bounds  on  the  number  of  required  measurements  for  stable 
compressed  sensing.  These  analyses  originated  from  work  in  functional  analysis  and 
approximation  theory  by  Kashin  [166],  and  were  later  improved  and  generalized  by 
Gluskin  [128,  129],  Cohen  et.  al.  [75],  and  Ba  et.  al  [13]. 

The  following  theorem  is  due  to  Cohen  et.  al  [75]  and  implies  that  G/G  best  7-term 
approximation  is  achievable  from  any  algorithm  that  provides  G/G  guarantees. 

Theorem  3.22  ([75]).  Let  $  be  an  M  x  N  sensing  matrix  and  let  Aq>  be  a  reconstruc¬ 
tion  algorithm  such  that  (<& ,  *4^)  -provides  an  G/G  sparse  approximation  guarantee. 
Then  also  provides  an  G/G  guarantee. 

Proof.  Let  a*  be  an  arbitrary  vector  in  M.N ,  let  f  =  <&«*.  Also  let  6t  =  A&(f),  and 
A  =  ol*  —  6t.  The  G/G  guarantee  of  (<&,,4$)  implies  that  &  is  7-sparse,  and  that 

||A||2  <  ^)|a*-Hfc(a*)||i.  (3.5.1) 

Let  S  =  Supp(Hfc(a*))  fl  Supp(a:).  Since  both  a  and  Hfc(cC)  are  7-sparse,  we  have 
|5|  <27. 

Therefore,  it  follows  from  Holder’s  inequality  that 

||As||i  <  a/2711  As||2  <  V2k\\A\\2.  (3.5.2) 

Combining  (3.5.1)  and  (3.5.2),  yields 

II  Aslli  <  ^p||a*  -  Hfc(a*)||i  =  V2C||a*  -  Hfc(a*)||i-  (3.5.3) 

On  the  other  hand,  since  S  includes  the  top  7  coordinates  of  a* 

llAslli  =  l|a£l|i  <  \\<x*  ~  Hfc(a*)||i. 

Therefore 

||«*  -  d||i  =  || AG  =  \\As\\i  +  IIA^IK  <  (l  +  y/2C^  ||«*  -  Hfc(a*)||i. 

□ 
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The  following  theorem  of  Ba  et.  al.  [13]  provides  lower  bounds  on  the  number  of 
required  measurements  for  obtaining  I\/t\  and  guarantees. 

Theorem  3.23  ([13]).  Let  $  be  an  M  x  N  sensing  matrix  and  let  Aq>  be  a  reconstruc¬ 
tion  algorithm  such  that  (4>,  A$)  -provides  an  i\/i\  sparse  approximation  guarantee, 
then  M  —  LI  (k  log  (pr))  ■ 

By  combining  Theorem  3.22  and  Theorem  3.23,  we  obtain  similar  lower  bounds  on 
the  number  of  required  measurements  for  li(t\  guarantees. 

Corollary  3.24.  Let  $  be  an  M  x  N  sensing  matrix  and  let  A&  be  a  reconstruction 
algorithm  such  that  ,  Aq>) -provides  an  ti/l\  guarantee,  then  M  —  LI  (/clog  (y-)). 

Proof.  Theorem  3.22  proves  that  if  (4?,  *4.$)  provides  an  tijly  guarantee,  then  it  also 
provides  an  i\jl\  guarantee.  Therefore,  it  follows  immediately  from  Theorem  3.23 
that  M  =  LI  (/clog  (y:))  measurements  are  necessary.  □ 

Remark  3.25.  In  Section  3.2  we  saw  that  the  2k  x  N  Vandermonde  construction 
of  Akcakaya  and  Tarokh  [5]  can  efficiently  recover  any  vector  which  is  exactly  k- 
sparse.  However,  since  2  k  =  o  (k  log  (pr)) ,  there  is  no  hope  of  finding  a  robust  sparse 
reconstruction  algorithm  with  I\fl\  or  Ii/I\  guarantee  for  this  construction.  This 
gives  another  explanation  why  the  proposed  algebraic  decoding  is  not  robust  against 
noise. 

Remark  3.26.  Theorem  3.1  with  eM  =  Om,  implies  that  if  4*  is  a  (2/c,  a/2  —  1)- 
RIP,  then  (4>,  Basis  Pursuit)  provides  best  k-term  approximation  guarantee. 

Therefore  by  invoking  Corollary  3.24,  anV  (2/c,  y/2  —  1  )-RIP  matrix  requires  M  = 
LI  (/clog  (f-))  measurements.  In  other  words,  one  cannot  expect  to  find  RIP  matrices 
with  smaller  number  of  measurements  M  —  o  (/clog  ((r))  .  On  the  other  hand,  The¬ 
orem  3.13  proves  that  as  long  as  M  =  O  (/clog  (f-)),  a  Gaussian  (or  Rademacher) 
sensing  matrix  is  (2/c,  a/2  —  1  )-RIP,  and  by  which  the  I2IO  guarantees  are  obtainable. 
This  shows  that  the  lower  bound  of  Corollary  3.24  tight. 

Remark  3.27.  In  Chapter  8  we  will  introduce  examples  of  RIP-less  sensing  matrices 
with  optimal  M  =  O  (k  log  ((]())  measurements  that  provide  I\fl\  sparse  approxi¬ 
mation  guarantees.  Our  proposed  matrices  are  sparse  and  have  deterministic  con¬ 
structions,  and  do  not  suffer  from  the  storage  and  computational  limitations  of  RIP 
Gaussian  or  Rademacher  matrices. 

Thus  far  we  have  explored  the  connections  between  the  ^2/^1  and  I\jl\  guarantees, 
and  seen  that  (/clog  (^))  measurements  are  necessary  and  sufficient  to  achieve 
these  guarantees.  Next,  we  will  see  whether  it  is  possible  to  extend  these  results  to 
achievability  of  ^2/^2  approximation  guarantees  or  not. 

An  argument  similar  to  the  one  used  in  Theorem  3.22  is  used  by  Cohen  et.  al  [75],  to 
prove  that  the  £2/^2  guarantee  also  implies  I\fl\  guarantees  (with  the  same  constant 
C);  however  £2/^2  does  not  necessarily  imply  the  ^2/^1  guarantee. 
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Nevertheless,  Cohen  et.  al  [75]  also  showed  that  the  £2/ £2  approximation  is  impossible 
in  general  in  the  compressed  sensing  framework  unless  M  =  Q(N). 

Theorem  3.28  ([75]).  Let  $  be  an  M  x  N  sensing  matrix  and  let  A $  be  a  reconstruc¬ 
tion  algorithm  such  that  (<& ,  A<f.) -provides  an  72/72  sparse  approximation  guarantee, 
then  M  =  fl  (N). 

Theorem  3.28  directly  implies  that  £\/£\  and  £2/^1  guarantees  cannot  imply  £2/ £2- 
Otherwise,  one  could  use  a  RIP  matrix  with  M  =  0(k  log  (y-))  =  o(N)  measurements 
and  the  7] -minimization  algorithm  and  get  the  O/O  guarantee  from  the  provided  £2/^1 
guarantee  of  Theorem  3.7,  or  from  the  l\ji\  guarantee  of  Theorem  3.22. 

Finally  we  emphasize  that  the  result  of  Theorem  3.28  is  only  existential.  It  only  shows 
that  if  <&  is  an  M  x  N  sensing  matrix  with  M  =  o(N),  then  for  each  decoding  algorithm 
A®,  there  exists  one  particular  vector  ol*  6  WiN  (that  may  depend  on  the  choice  of 
A&)  such  that  || ok*  —  ✓^(‘Fa:*)  ||2  is  significantly  large.  In  contrast,  in  Chapter  12  we 
will  provide  examples  of  deterministic  sensing  matrices  with  M  =  O(klogN)  and 
efficient  reconstruction  algorithms  A&,  such  that  (<&,.A$)  provides  72/72  guarantees 
for  most  (in  contrast  to  all )  vectors. 
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Chapter  4 
Applications 


4.1  Compressive  Imaging 

4.1.1  Image  Compression 

A  fundamental  assumption  in  digital  image  processing  is  that  natural  images  are 
piecewise  smooth  in  the  pixel  basis.  That  is,  there  are  very  few  edges  in  the  image, 
and  therefore,  the  differences  between  the  values  of  adjacent  pixels  are  usually  zero 
or  almost  zero.  The  wavelet  transform  can  be  used  to  map  images  from  the  pixel 
domain  to  the  wavelet  domain  in  which  they  have  sparse  (or  approximately  sparse) 
representations  [84], 

For  example,  Figure  4.1(b)  shows  the  representation  of  a  natural  image  in  the  pixel 
domain,  and  Figure  4.1(b)  shows  the  representation  of  the  same  image  in  the  wavelet 
domain.  As  you  can  see  from  the  figure,  there  are  very  few  significant  (light)  coeffi¬ 
cients  in  the  wavelet  representation  of  this  image,  whereas  most  wavelet  coefficients 
are  almost  zero  (black). 

The  wavelet  sparsity  of  images  is  used  in  the  image  compression  application  [131].  In 
order  to  compress  a  a /N  x  image,  the  camera  first  treats  the  image  as  a  high 
A-dimensional  vector  and  calculates  its  wavelet  representation.  Finally  it  stores  the 
positions  and  values  of  the  k  -C  N  significant  wavelet  coefficients  and  throws  away 
the  remaining  information.  The  decoding  can  therefore  be  done  efficiently  by  forming 
the  (sparsihed)  wavelet  vector,  and  applying  the  inverse  wavelet  transform  to  restore 
the  image. 

Since  images  are  approximately  sparse  in  the  wavelet  basis,  the  sparsihed  image  still 
provides  a  good  approximation  of  the  original  image.  For  instance,  Figure  4.2  shows 
the  resulting  images  when  only  the  largest  1%,  3%  or  10%  of  the  wavelet  coefficients 
of  an  256  x  256  image  are  used. 

Even  though  this  image  compression  protocol  provides  precise  sparse  approximation 
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(a)  A  256  image  in  the  pixel  domain  (b)  The  representation  of  the  same  image  in  the 

wavelet  domain.  Here  the  intensities  correspond 
to  wavelet  coefficient  magnitudes. 


Figure  4.1:  Example  of  sparse  approximations  in  a  wavelet  basis. 


to  most  natural  images,  it  is  inefficient  and  costly.  It  is  inefficient  because  we  ul¬ 
timately  throw  away  most  of  the  calculated  wavelet  coefficients.  We  calculate  N 
wavelet  coefficients  but  then  we  just  keep  the  k  largest  coefficients  and  discard  the 
rest  of  them.  It  is  also  costly  as  the  camera  requires  N  sensors,  whereas  only  fCiV 
coefficients  are  ultimately  stored. 

Here,  compressed  sensing  can  be  used  as  a  new  data  acquisition  framework,  to  over¬ 
come  the  inefficiencies  of  the  classical  image  compression  approach  [216].  In  contrast 
to  the  classical  approach,  which  involves  sensing  a  high-resolution  signal  and  then 
compressing  it  by  throwing  away  part  of  the  sensed  data,  compressed  sensing  at¬ 
tempts  to  develop  methods  to  sense  signals  directly  into  compressed  form  [53]. 

To  see  how  compressed  sensing  works  for  image  compression,  let  T'  denote  the  N  x  N 
wavelet  transform  matrix.  Also  let  l  denote  an  N  dimensional  image.  Then  the  vector 
OL*  =  is  the  wavelet  representation  of  image  and  is  approximately  fc-sparse  (with 
k^N). 

Let  $  be  an  M  x  N  sensing  matrix  (with  M  m  k  • C  N ),  and  let  A  =  The  mea¬ 
surement  matrix  A  is  obtained  by  combining  the  sensing  matrix  $  and  the  wavelet 
transform  matrix  Tr.  In  standard  compressed  sensing,  the  compressed  vector  f  is 
obtained  by  finding  the  k  largest  wavelet  coefficients  of  the  image  (i.e.  f  =  H/c(TrZ)). 
This  approach  requires  0(N2)  operations.  In  contrast,  in  compressed  sensing  we  use 
the  M  X  N  matrix  A  to  compress  the  image.  That  is 

f  =  Al  =  =  QW'a*  =  $«*.  (4.1.1) 
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(a)  Image  obtained  by  using  the  (b)  mage  obtained  by  using  the  (c)  mage  obtained  by  using  the 
1%  largest  coefficients.  3%  largest  coefficients.  10%  largest  coefficients. n 


Figure  4.2:  Resulting  images  when  only  the  largest  1%,3%,  or  10%  largest  db2  coef¬ 
ficients  are  used. 

Therefore,  if  the  measurement  matrix  A  is  precomputed,  the  encoding  f  =  Al  can 
be  performed  efficiently  using  0(MN )  operations. 

A  digital  camera  usually  has  limited  computational  resources.  However,  the  image 
recovery  is  usually  done  once  the  camera  is  connected  to  a  computer  that  has  more 
powerful  computation  resources.  Therefore,  given  the  measurement  vector  /,  and 
the  prior  information  that  ot*  is  (approximately)  k- sparse,  sparse  approximation  al¬ 
gorithms  can  be  used  to  find  a  sparse  approximation  &  for  a*.  Subsequently,  the 
inverse  wavelet  transform  can  find  an  approximation  l  for  the  image  l  in  the  pixel 
domain.  Since  the  wavelet  transform  is  unitary 

||Z  -  Z||2  =  ll^c**  -  d)||2  =  ||«*  -  d||2. 

The  sparse  approximation  error  in  the  pixel  domain  is  the  same  as  the  error  in  the 
wavelet  domain. 

4.1.2  Single-Pixel  Camera 

Compressed  sensing  addresses  the  computational  issue  central  to  classic  image  com¬ 
pression.  However,  it  still  first  measures  the  whole  image  in  the  pixel  domain  using 
N  sensors  and  then  performs  the  compression.  To  overcome  this  final  challenge,  new 
hardware,  called  the  single-pixel  camera  was  developed  at  Rice  University  [232,  102], 

The  camera  uses  a  small  array  of  chip  mirrors,  each  mirror  corresponding  to  a  pixel 
of  the  image.  These  mirrors  can  be  independently  rotated  to  either  reflect  the  light 
towards  a  lens  (on  state)  or  away  from  it  (off  state).  The  mirrors  can  turn  on  and 
off  very  quickly,  and  thus  one  pixel  can  be  partially  reflected  as  determined  by  the 
ratio  between  the  on  and  off  time.  A  photodiode  is  then  put  in  the  cannon  of  the 
lens  to  convert  the  accumulated  light  intensity  into  a  quantitative  measurement.  By 
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repeating  this  process  M  times,  we  can  sense  a  number  M  of  linear  measurements 

/  =  $«*. 

In  summary,  in  the  single  pixel  camera,  the  linear  measurements  are  performed  effi¬ 
ciently  by  nature,  and  only  one  sensor  (photodiode)  is  required  for  the  whole  proce¬ 
dure.  This  is  in  particular  advantageous  if  the  arrays  of  N  high-resolution  sensors  (as 
used  in  classics  digital  cameras)  are  too  expensive  or  even  not  available,  for  instance 
in  infrared  imaging. 

4.1.3  Biomedical  Imaging 

Another  important  application  for  compressed  sensing  is  in  reducing  the  sampling  rate 
in  magnetic  resonance  imaging  (MRI)  [132],  Traditional  MRI  scanners  sequentially 
sample  Fourier  coefficients  of  the  human  brain’s  image.  Unfortunately,  this  traditional 
MRI  approach  is  very  time  costly,  as  the  speed  of  data  collection  is  limited  by  physical 
and  physiological  constraints  However,  most  MRI  images  are  sparse  in  the  Fourier 
domain.  As  a  result  compressed  sensing  can  be  used  to  significantly  decrease  the 
number  measurements  without  reducing  the  accuracy  of  the  MRI  image  [181]. 


4.2  Data  Streaming 

In  data  streaming  applications,  devices  with  limited  memory  process  massive  streams 
of  data  [196,  8].  For  instance,  in  a  network  with  232  addressees,  a  monitoring  table 
counts  the  number  of  packets  going  from  each  source  address  to  each  destination 
address.  The  monitoring  table  is  therefore  a  232  x  232  table,  and  the  entry  at  row  i 
and  column  j  of  the  table  shows  the  number  of  packets  going  from  the  source  address 
i  to  the  destination  address  j. 

Storing  the  whole  table  requires  264  memory  and  is  not  practically  feasible.  However, 
this  monitoring  table  is  often  approximately  sparse.  There  are  a  few  source/destination 
pairs  with  a  significant  number  of  packets,  whereas  most  pairs  communicate  no  or 
very  few  packets.  If  we  are  interested  in  the  most  traffic-heavy  pairs,  our  aim  is  to 
obtain  (approximately)  the  heaviest  elements  of  the  table. 

This  problem  can  be  viewed  as  a  compressed  sensing  application  if  we  represent  the 
monitoring  table  as  a  high-dimensional  vector  ot*  G  (e.g.  N  =  264.).  The  goal  is 
then  to  design  an  efficient  M  x  N  matrix  (with  k  ~  M  <C  N),  such  that  for  every 
table  ot*,  the  M-dimensional  sketch  f  =  <&«*  captures  most  information  regarding 
the  significant  entries  of  ot*.  That  is,  we  aim  is  to  obtain  a  fc-sparse  approximation 
to  ot*  from  the  sketch  vector  /  = 

It  is  easy  to  see  that  the  encoding  can  be  done  efficiently  in  real-time  thanks  to  the 
linearity  of  the  update  operation.  Let  ot *l  denote  the  monitoring  table  at  time  t,  for 
which  we  only  have  access  to  the  sketch  f  l  =  Qot*1 .  Also,  A*  G  be  the  vector 
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that  contains  the  number  of  packets  that  have  arrived  in  the  interval  [t,t  +  1).  The 
sketch  of  the  new  table  ct*t+1  =  a.*1  +  A*  is  &ct*t+1  =  <&( ol *4  +  A*)  =  Qol*1  +  t&A*. 
Thus  we  can  directly  update  the  sketch  by  calculating  $A*.  At  the  end  of  the  day, 
given  the  final  sketch  <&«*,  one  can  use  efficient  sparse  approximation  algorithms 
to  find  a  fc-sparse  monitoring  table  6t  that  closely  approximates  the  true  monitoring 
table  a*. 


4.3  Digital  Communications 

The  problem  of  configuring  wireless  networks  to  enable  network  communication  in 
the  presence  of  inference  is  one  of  the  major  challenges  facing  communication  research 
[245].  One  important  case  is  managing  inference  in  peer  to  peer  networks  and  in  an 
uplink  where  multiple  sensors  look  to  communicate  with  an  access  point. 

the  interference-mitigation  for  downlink  communications  in  which  a  single  transmitter 
(e.g.  a  cellular  base  station)  communicates  simultaneously  with  multiple  (. N )  receivers 
[2,  10], 

The  key  idea  connecting  compressed  sensing  to  wireless  communication  is  that  at 
each  time  only  a  small  (k  -C  N ))  number  of  receivers  are  active.  The  sender  then 
maintains  an  M  x  N  sensing  matrix  <h,  such  that  the  ith  columns  of  $  is  associated 
with  the  ith  user. 

At  each  transmission  time,  the  transmitted  signal  is  constructed  as  the  sum  of  indi¬ 
vidual  signals,  each  intended  to  a  different  receiver.  That  is,  the  transmitted  signal  is 
a  superposition  of  at  most  k  columns  of  the  matrix.  With  this  strategy,  each  receiver 
can  also  invoke  sparse  reconstruction  algorithms  and  decode  its  own  information. 


4.4  Group  Testing 

Group  testing  is  the  problem  of  devising  tests  to  efficiently  identify  members  of  a 
group  with  a  certain  property  [101,  74],  The  group  testing  applications  range  from 
the  blood  testing  problem  which  was  used  in  World  War  II  for  identifying  men  who 
carry  a  certain  disease  [100],  to  the  problem  of  testing  the  impacts  of  new  drugs  on 
human  genes  [83,  164]. 

In  group  testing  the  aim  is  to  avoid  individual  testing  of  all  candidates  by  repeatedly 
pooling  up  a  subgroup  of  multiple  individuals  and  testing  this  subgroup  instead  [74] . 
It  is  often  assumed  that  there  are  only  a  few  people  sharing  some  specified  property, 
and  the  goal  is  to  design  an  M  X  N  test  matrix  describing  the  M  subgroup  tests,  so 
that  it  is  possible  to  efficiently  recover  the  sparse  special  members  of  the  group  from 
the  tests  [124]  (see  also  [122,  175]). 
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4.5  Machine  Learning 


Compressed  sensing  has  been  recently  used  for  solving  the  face  classification  problem, 
in  which  the  goal  is  to  predict  whose  face  a  new  (test)  face  is  given  a  large  collection 
of  labeled  training  faces  [256] . 

A  key  assumption  in  face  classification  is  that  all  faces  of  most  human-beings  lie  in  a 
low  dimensional  subspace,  and  much  fewer  degrees  of  freedom  (compared  to  the  total 
number  of  pixels),  govern  the  structure  of  all  possible  faces  [246,  254],  As  a  result, 
given  a  sufficiently  rich  training  set  of  faces  for  a  particular  person,  any  new  (test)  face 
can  be  represented  by  a  linear  combination  of  her  training  faces,  and  therefore  by  a 
sparse  linear  combination  of  all  training  faces  of  all  people  in  the  training  repository. 

Therefore,  sparse  approximation  can  be  used  to  identify  the  person  whose  training 
faces  form  the  largest  contribution  in  approximating  the  test  face  [255].  Using  com¬ 
pressed  sensing  and  sparse  approximation  has  provided  significant  improvements  over 
the  existing  state-of-the-art  methods  that  use  support  vector  machines  [250],  or  prin¬ 
cipal  component  analysis  [170]. 

A  similar  approach  has  been  used  in  speaker  identification  and  speech  recognition 
applications  [260,  165,  225].  Here  using  compressed  sensing  and  sparse  approximation 
has  given  classification  accuracy  improvements  on  the  standard  datasets  after  more 
than  20  years  [220]. 

Compressed  sensing  has  also  been  used  for  efficiently  solving  the  multi-label  classifica¬ 
tion  problems  with  large  label  space  size  N  [151].  It  has  been  shown  both  theoretically 
and  experimentally  that  under  the  reasonable  assumption  that  each  example  has  at 
most  fc«iV  associated  labels,  the  compressed  sensing  approach  is  more  efficient  and 
robust  compared  to  other  multi-label  classification  approaches. 

Dictionary  learning  is  another  machine  learning  application  in  which  using  compressed 
sensing  is  advantageous.  Dictionary  learning  is  a  powerful  tool  in  machine  learning 
with  applications  in  source  separation  in  music  [236],  object  recognition  in  computer 
vision  [235],  and  image  denoising  in  digital  image  processing  [184].  In  dictionary 
learning,  the  M  x  N  matrix  (also  called  the  overcomplete  dictionary)  is  learned  from 
the  available  training  examples.  It  has  been  shown  that  one  approach  for  solving  the 
dictionary  learning  problem  is  to  solve  a  series  of  non-convex  optimization  problems 
iteratively,  where  each  non-convex  optimization  consists  of  solving  a  sparse  coding 
problem  followed  by  a  convex  optimization  problem  [103,  104],  Devising  efficient 
sparse  approximation  algorithms  for  sparse  coding  can  facilitate  the  task  of  learning 
overcomplete  dictionaries. 
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4.6  Quantum  Computing 


A  major  obstacle  to  engineer  quantum  devices  such  as  quantum  computers  had  been 
lack  of  an  effective  scheme  for  noise  characterization  in  many  component  systems. 
The  number  of  parameters  required  to  represent  the  state  of  a  quantum  system  grows 
exponentially  with  the  number  of  its  components  in  contrast  to  a  classical  system. 
As  a  result  the  number  of  measurements  needed  for  full  characterization  of  the  noisy 
dynamics  of  a  quantum  system  becomes  astronomically  large. 

In  [226],  Shabani  et.  al.  have  developed  a  CS  theory  to  estimate  the  effect  of  noise 
on  a  quantum  system  dynamics.  They  show  a  linear  relation,  f  =  $«*,  between  the 
parameters  of  a  noisy  quantum  dynamics,  «*,  and  measurement  outcomes,  /.  The 
sparsity  property  is  assumed  for  the  signal  a*  holds  under  some  common  physical 
conditions. 
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Chapter  5 
Thesis  Outline 


5.1  Thesis  Statement 

In  this  thesis,  we  shall  see  that  our  proposed  deterministic  sensing  framework  is 
significantly  more  powerful  from  many  practical  applications,  compared  to  the  con¬ 
ventional  “random  projections  followed  by  -^-minimization”  framework  used  in  com¬ 
pressed  sensing. 


5.2  Main  Contributions 

In  this  section  we  outline  the  main  contributions  of  the  thesis.  We  also  provide  ref¬ 
erences  to  the  papers  that  cover  the  main  materials  of  each  chapter.  The  central 
objective  of  this  thesis  is  to  provide  efficient  deterministic  sensing  frameworks  that 
avoid  the  performance,  storage  and  computational  limitations  of  the  random  sensing 
framework.  Towards  this  end,  we  will  first  introduce  efficient  and  generic  recovery  al¬ 
gorithms  that  do  not  rely  on  non- verifiable  properties,  such  as  the  restricted  isometry 
property.  We  focus  on  two  important  complementary  tasks  of  sparse  approximation 
and  model  selection.  In  sparse  approximation  the  goal  is  to  find  a  sparse  vector  suffi¬ 
ciently  close  to  the  sparse  target  vector  in  some  metric,  whereas  in  the  model  selection 
the  objective  is  to  recover  the  support  of  the  sparse  target  vector  in  the  presence  of 
noise. 

The  first  half  of  this  thesis  focuses  on  the  sparse  approximation  problem.  In  Chap¬ 
ter  6.1,  we  will  show  that  the  sparse  approximation  problem  can  always  be  refor¬ 
mulated  as  a  zero-sum  game.  Then,  in  Chapter  7,  we  will  introduce  the  Bregman 
divergence  as  a  generalization  of  the  Euclidean  distance.  We  will  also  propose  and 
analyze  an  efficient  algorithm,  called  the  GAME  algorithm,  that  approximately  solves 
the  sparse  approximation  problem  by  simulating  a  repeated  game  between  the  two 
players  of  the  zero-sum  game.  The  algorithm  is  generic  and  does  not  assume  any 
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non- verifiable  assumption  regarding  the  sensing  matrix  $  [158]. 

In  Chapter  8,  we  will  introduce  the  expander-based  compressed  sensing  as  our  first 
deterministic  sensing  framework.  We  will  also  propose  an  efficient  recovery  algorithm 
capable  of  recovering  any  k  sparse  vector  in  at  most  2k  simple  iterations  in  the 
noiseless  settings  [159].  In  Chapter  9  we  focus  on  bounded  -G-norm  noise  model, 
and  show  that  in  that  model,  if  the  an  expander-based  sensing  matrix  is  used,  then 
it  is  possible  to  significantly  tighten  the  generic  bounds  of  the  GAME  algorithm. 
Empirical  results  support  the  fidelity  of  the  GAME  algorithm  [156]. 

In  Chapter  10,  we  consider  the  problem  of  expander-based  compressed  sensing  in 
the  presence  of  Poisson  noise.  Poisson  noise  is  an  important  noise  model  in  appli¬ 
cations  such  as  low-light  imaging  and  data  streaming.  We  will  show  that  in  the 
expander-based  compressed  sensing  framework,  a  Bayesian  reconstruction  algorithm 
can  provably  recover  a  close  approximation  to  any  sparse  target  vector.  This  means 
that  in  the  Poisson  noise  model,  expander-based  compressed  sensing  not  only  pro¬ 
vides  storage  and  computational  advantages  over  the  dense  random  sensing,  but  it 
also  gains  sparse  approximation  guarantees  that  are  not  directly  obtainable  in  the 
dense  sensing  framework  [211,  212,  157]. 

The  second  half  of  the  thesis  investigates  the  model-selection  problem.  In  Chapter  11, 
we  will  introduce  two  fundamental  measures  of  coherence  between  the  columns  of  a 
sensing  matrix.  We  will  further  show  that  as  long  as  the  sensing  matrix  satisfies  a 
verifiable  coherence  property ,  a  simple  and  efficient  One-Step  Thresholding  algorithm 
is  capable  of  finding  the  support  of  most  sparse  vectors  [17,  14,  16]. 

Reed-Mullcr  sensing  is  the  second  proposed  deterministic  sensing  framework.  Chap¬ 
ter  12  introduces  the  Delsarte-Goethals  frames,  as  a  family  of  deterministic  sensing 
with  optimal  measures  of  coherence.  The  Delsarte-Goethals  frames  are  generated 
from  the  Delsarte-Goethals  codes,  which  are  a  properly  chosen  subset  of  the  sec¬ 
ond  order  Reed-Mullcr  codes.  We  will  show  how  the  coherence-optimality  of  the 
DG  frames  relates  to  model-selection  optimality  of  the  OST  algorithm  in  the  Reed- 
Mullcr  Sensing  framework.  To  demonstrate  the  efficiency  of  the  OST  algorithm,  we 
also  show  that  in  our  C++  implementation,  it  only  takes  about  one  minute  for  the 
OST  algorithm  to  recover  sparse  232  dimensional  vectors  from  212  DG  frame-based 
measurements  [44,  45,  46,  47,  174], 

Finally  in  Chapter  13  we  describe  the  model-based  compressed  sensing  problem.  In 
model-based  compressed  sensing,  some  extra  prior  knowledge  (e.g.,  positivity,  block 
sparsity,  etc)  is  also  available  about  the  target  sparse  vector.  In  this  setting,  we  will 
introduce  an  iterative  algorithm,  called  the  NIHT  algorithm,  which  can  incorporate 
the  available  extra  prior  knowledge  and  approximately  solve  the  model-based  sparse 
approximation  problem.  The  NIHT  algorithm  can  be  considered  as  a  generalization 
of  the  OST  algorithm,  as  the  OST  algorithm  is  equivalent  to  the  NIHT  algorithm 
run  for  only  one  iteration.  We  will  provide  several  different  experiments  to  show  that 
NIHT  can  empirically  outperform  .^-minimization  methods  in  different  compressed 
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sensing  settings  [69].  Figure  5.1  summarizes  the  main  contributions  of  this  thesis. 
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Figure  5.1:  The  deterministic  sensing  map:  summary  of  the  main  contributions  of 
this  thesis. 
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Generality:  veritable  conditions 


Part  II 

Sparse  Approximation  for 
Compressed  Sensing 


47 


Chapter  6 


Game  Theory  Meets  Compressed 
Sensing 

6.1  Game  Theoretic  Reformulation  of  Sparse  Ap¬ 
proximation 

Sparse  approximation  is  a  fundamental  problem  in  compressed  sensing  (see  Sec¬ 
tion  3.1),  as  well  as  in  many  other  signal  processing  and  machine  learning  applications 
including  variable  selection  in  regression  [233,  247,  193],  graphical  model  selection 
[213,  191],  and  sparse  principal  components  analysis  [209,  162],  In  sparse  approxima¬ 
tion,  one  is  provided  with  a  dimensionality  reducing  measurement  matrix  $  G  MMxAr 
(M  <  N ),  and  a  low  dimensional  vector  f  G  RA/.  The  goal  is  to  find  a  sparse  vector 
a  such  that  $6:  is  sufficiently  close  to  f. 

In  this  chapter,  we  consider  the  sparse  approximation  problem  in  the  iq  norm,  where 
q  is  a  positive  integer.  Let  k  be  a  positive  integer,  and  let  r  be  an  arbitrary  positive 
number.  Define 

A(r)  =  {ck  G  Rn  :  ||a||i  <  t},  (6.1.1) 

as  the  set  of  all  vectors  inside  the  hyper-diamond  of  radius  r,  and  define 

A (fc,  r)  =  {ck  G  M.n  :  || o: || o  <  k  and  ||ck||i  <  r},  (6.1.2) 

as  the  set  of  all  fc-sparse  vectors  in  A(r). 

We  shall  prove  that  for  every  dimension  reducing  matrix  <&,  and  every  measurement 
vector  /,  one  can  a  find  vector  a  G  A (k,  r)  with 

||®a-/||9<  min  \\&a-f\\q  +  d  (6.1.3) 

c*eA(fc,r)  WkJ 

This  sparse  approximation  framework  works  for  any  matrix  $,  and  not  just  for  matri¬ 
ces  satisfying  the  RIP.  Later  on,  In  Chapter  9  we  will  see  that  if  $  is  a  deterministic 
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matrix  constructed  from  expander  graphs,  then  the  provided  bounds  of  this  chapter 
can  be  further  tightened  to  an  t\fl\  data-domain  sparse  approximation  guarantee. 

Note  that  since  A (k,  r)  is  not  convex,  the  optimization  problem 

minimize^ A(fc,T)  -  /||,  (6.1.4) 

is  not  a  convex  optimization.  This  optimization  problem  is  actually  NP-hard  in 
general  [197],  and  cannot  be  solved  precisely.  However,  in  this  chapter  we  will  show 
that  there  exist  efficient  algorithms  that  can  provide  an  approximate  solution. 

We  reformulate  this  sparse-approximation  problem  as  a  zero-sum  game,  and  then 
propose  a  computationally  efficient  algorithm  to  obtain  a  sparse  approximation  for 
the  optimal  game  solution.  The  proposed  algorithms  employ  a  primal-dual  scheme, 
and  require  0(k)  iterations  in  order  to  find  a  /c-sparse  vector  with  O  (fc”0  '5)  additive 
approximation  error. 

We  start  by  defining  a  zero-sum  game  and  then  proving  that  the  sparse  approximation 
problem  of  Equation  (6.1.4)  can  be  reformulated  as  a  zero-sum  game. 

Definition  6.1  (Zero-sum  games  [207]).  Let  A  and  B  be  two  closed  sets.  Let  £  : 
A  x  B  — >  R  be  a  function.  The  value  of  a  zero  sum  game ,  with  domains  A  and  B 
with  respect  to  a  function  £  is  defined  as 

minmax £(a,  b).  (6.1.5) 

a^A  b&B 

The  function  £  is  usually  called  the  loss  function.  A  zero-sum  game  can  be  viewed  as 
a  game  between  two  players  Mindy  and  Max  in  the  following  way.  First,  Mindy  finds 
a  vector  a,  and  then  Max  finds  a  vector  b.  The  loss  that  Mindy  suffers1  is  £(a,  b). 
The  game- value  of  a  zero-sum  game  is  then  the  loss  that  Mindy  suffers  if  both  Mindy 
and  Max  play  with  their  optimal  strategies. 

Von  Neumann’s  well-known  Minimax  Theorem  [206,  116]  states  that  if  both  A  and 
B  are  convex  compact  sets,  and  if  the  loss  function  £(a,  b )  is  convex  with  respect  to 
a,  and  concave  with  respect  to  b,  then  the  game-value  is  independent  of  the  ordering 
of  the  game  players. 

Theorem  6.2  (Von  Neumann’s  Minimax  Theorem  [206]).  Let  A  and  B  be  closed 
convex  sets,  and  let  £  :  A  x  B  — >  M  be  a  function  which  is  convex  with  respect  to  its 
first  argument,  and  concave  with  respect  to  its  second  argument.  Then 

inf  sup C(a.b)  =  sup  inf  C(a.b). 

<*6-4  bes  6GB 


For  the  history  of  the  Minimax  Theorem  see  [173].  The  Minimax  Theorem  tells  us 
that  for  a  large  class  of  functions  £,  the  values  of  the  min- max  game  in  which  Mindy 

■Which  is  equal  to  the  gain  that  Max  obtains  as  the  game  is  zero-sum. 
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goes  first  is  identical  to  the  value  of  the  max-min  game  in  which  Max  starts  the  game. 
The  proof  of  the  Minimax  Theorem  is  provided  in  [117]. 

Having  defined  a  zero-sum  game,  and  the  Von  Neumann  Minimax  Yheorem,  we  next 
show  how  the  sparse  approximation  problem  of  Equation  (6.1.4)  can  be  reformulated 
as  a  zero-sum  game.  Let  p  =  and  define 

Sp  =  {P  G  Rm  :  ||P||p  <  1}.  (6.1.6) 

Define  the  loss  function  £  :  x  A(r)  — >  1  as 

£(P,a)  =  (P  ,($<*-/)).  (6.1.7) 

Observe  that  the  loss-function  is  bilinear.  Now  it  follows  from  Holder  inequality 
(Theorem  2.1)  that  for  every  ot  in  A (k,r),  and  for  every  P  in 

£(P,a)  =  (P-  (<&a  -  /))  <  ||P||p||$a  -  f\\q  <  ||Sa  -  f\\q.  (6.1.8) 

The  inequality  of  Equation  (6.1.8)  becomes  equality  for 

p.  _  (*q  -  t)f 

(E"l 

Therefore 

max  £(P. «)  =  max  (P,  ($a  -  /))  =  (P*,  ($a  -  /))  =  ||Sa  -  f\\q.  (6.1.9) 

PeHp  PeSp 

Equation  (6.1.9)  is  true  for  every  ot  £  A (r).  As  a  result,  by  taking  the  minimum  over 
A (k,  t )  we  get 

min  ||$a  —  f\\0—  min  max£(P,a). 

aGA(fc,r)  CKGA(fc,r)  P 

Similarly  by  taking  the  minimum  over  A  (r)  we  get 

min  |$ck  —  f\\n—  min  max£(P,a).  (6.1.10) 

agA(r)  «GA(r)PeHp  V  7  V  J 

Solving  the  sparse  approximation  problem  of  Equation  (6.1.4)  is  therefore  equivalent 
to  Ending  the  optimal  strategies  of  the  game 

min  max£(P,a).  (6.1.11) 

aGA(/c,r)  P 

In  the  next  section  we  provide  a  primal-dual  algorithm  that  approximately  solves  this 
min-max  game.  Observe  that  since  A (k,r)  is  a  subset  of  A (r),  we  always  have 

min  max  C  ( P .  ot )  <  min  max  £  ( P ,  ot ) , 

aeA(r)  P6=p  aeA(/c,r)  PGHp 

and  therefore,  in  order  to  approximately  solve  the  game  of  Equation  (6.1.11),  it  is 
sufficient  to  find  ot  £  A(k,  r)  with 

max£(P,a)~  min  max£(P,a).  (6.1.12) 

PGHp  c*gA(t)  PGHp 
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Chapter  7 

A  Primal-Dual  Approach  for 
Sparse  Approximation 


In  this  chapter  we  provide  an  efficient  algorithm  for  approximately  solving  the  sparse 
approximation  problem  of  Equation  (6.1.4).  Our  approximation  algorithm  highly  re¬ 
lies  on  Bregman  projections  [65].  Therefore,  before  introducing  the  GAME  algorithm, 
we  first  provide  a  few  important  properties  of  Bregman  projections. 


7.1  Bregman  Distances  and  Projections 

Bregman  divergences  or  Bregman  distances  are  an  important  family  of  distances  that 
all  share  similar  properties  [65,  40]. 

Definition  7.1  (Bregman  Distance).  Let  7 Z  :  S  ^  be  a  continuously- differentiable 
real-valued  and  strictly  convex  function  defined  on  a  closed  convex  setS .  The  Bregman 
distance  associated  with  1Z  for  points  P  and  Q  is: 

BA P,  Q)  =  ft(P)  -  77( Q)  -  ((P  -  Q),  V77(Q)>. 

Intuitively,  the  Bregman  distance  measures  the  strictness  of  convexity  of  the  function 
77.  and  its  geometric  significance  is  illustrated  in  Figure  7.1.  The  Bregman  divergence 
is  the  vertical  distance  at  P  between  the  graph  of  77  and  the  line  tangent  to  the  graph 
of  77  in  Q.  Table  7.1  summarizes  examples  of  the  most  widely  used  Bregman  functions 
and  the  corresponding  Bregman  distances. 

Note  that  the  Bregman  distance  is  not  a  metric.  It  is  not  symmetric,  and  it  does  not 
satisfy  the  triangle  inequality.  However,  it  has  several  important  properties  that  we 
will  use  later  in  analyzing  our  sparse  approximation  algorithm. 

Theorem  7.2.  Bregman  distance  satisfies  the  following  properties: 
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Figure  7.1:  The  Bregman  divergence  associated  with  a  continuously-differentiable 
real-valued  and  strictly  convex  function  77  is  the  vertical  distance  at  P  between  the 
graph  of  77  and  the  line  tangent  to  the  graph  of  77  in  Q. 

•  (PI).  (P,  Q)  >  0,  and  the  equality  holds  if  and  only  if  P  =  Q. 

•  (P2).  For  every  fixed  Q  if  we  define  P)  =  BA P,Q),  then 

V£(P)  =  V77(P)  -  V77(Q). 

•  (PS).  Three  point  property:  For  every  P,  Q  and  T  in  S 

Bn{ P-  Q)  =  Bn( P,  T)  +  Bn{ T.  Q)  +  ((P  -  T),  V77(Q)  -  V77(T)). 

•  (Pf).  For  every  P.  Q  e  S, 

BA P,  Q)  +  &r{Q,  P)  =  ((P  -  Q),  (V77(P)  -  V77(Q))). 

Proof.  All  four  properties  follow  directly  from  Definition  7.1.  □ 

Now  that  we  have  introduced  important  properties  of  Bregman  distances,  we  are 
ready  to  define  Bregman  projections  of  points  into  convex  sets. 
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Table  7.1:  Summary  of  the  most  popular  Bregman  functions  and  their  corresponding 
Bregman  distances.  Here  $  is  a  positive  semidefinite  matrix. 


Name 

Bregman 
Function  (77(P)) 

Bregman 

Distance  (^(P.Q)) 

Squared 

Euclidean 

II  p  II 2 

IWII2 

IIP  -  Qlll 

Squared 

Mahalanobis 

<P,*P> 

((P-Q),$(P-Q)) 

Kullback-Leibler 

EiPiiogPi-EiPi 

Ei  p.  log  S  -  E.  p.  +  Ei  Qi 

Itakura-Saito 

E.- log  Pi 

Ei(fe-iog|  +  i) 

Definition  7.3  (Bregman  Projection).  LetlZ  :  S  — >  M  be  a  continuously- differentiable 
real-valued  and  strictly  convex  function  defined  on  a  closed  convex  set  S.  Let  Q  be  a 
closed  subset  of  S.  Then,  for  every  point  Q  in  S,  the  Bregman  projection  of  Q  into 
denoted  as  'Pn(Q)  is 

Vn{Q)  =  argminBw(P,  Q). 


Bregman  projections  satisfy  a  generalized  Pythagorean  Theorem. 


Theorem  7.4  (Generalized  Pythagorean  Theorem  [65]).  Let  1Z  :  S  — >  M.  be  a 
continuously-differentiable  real-valued  and  strictly  convex  function  defined  on  a  closed 
convex  set  S.  Let  Q  be  a  closed  subset  of  S.  Then  for  every  PeU  and  Qe<S 

Bn( P,  Q)  >  B*(P,  Vn(Q))  +  Bn(Vn( Q),  Q),  (7.1.1) 


and  in  particular 


^(P,Q)  >^(P,Po(Q)). 


(7.1.2) 


The  generalized  Pythagorean  Theorem  is  illustrated  in  Figure  7.2.  We  refer  the  reader 
to  [65],  or  [66]  for  a  proof  of  this  theorem  and  further  discussions. 


7.2  GAME  Algorithm  for  Sparse  Approximation 

In  this  section  we  provide  an  efficient  algorithm  for  approximately  solving  the  problem 
of  sparse  approximation  in  lq  norm,  defined  by  Equation  (6.1.3).  Let  £(P,  a)  be  the 
loss  function  defined  by  Equation  (6.1.7),  and  recall  that  in  order  to  approximately 
solve  Equation  (6.1.3),  it  is  sufficient  to  find  a  sparse  vector  <3:  G  A (k,r)  such  that 

max£(P,a)~  min  maxT(P,a).  (7.2.1) 

PSSp  a'6A(r)  P£Sp 
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Figure  7.2:  Generalized  Pythagorean  Theorem  states  that  if  'Pn(Q)  is  the  Bregman 
projection  of  Q  into  fi,  then  every  other  point  P  in  fl  has  larger  Bregman  distance 
to  Q  than  to  Vq(Q).  That  is  Bn( P,Q)  >  ^(P,Pn(Q))- 

The  original  sparse  approximation  problem  of  Equation  (6.1.3)  is  NP-complete,  but 
it  is  computationally  feasible  to  compute  the  value  of  the  min-max  game 

min  max£(P.o:).  (7.2.2) 

ct'eA(r)  PsHp 

The  reason  is  that  the  loss  function  £(P.  a.)  of  Equation  (6.1.7)  is  a  bilinear  function, 
and  the  sets  A(r),  and  Sp  are  both  convex  and  closed. 

Therefore,  finding  the  game  values  and  optimal  strategies  of  the  game  of  Equa¬ 
tion  (7.2.2)  is  equivalent  to  solving  a  convex  optimization  problem  and  can  be  done 
using  off-the-shelf  non-smooth  convex  optimization  methods  [204,  203].  However, 
if  an  off-the-shelf  convex  optimization  method  is  used,  then  there  is  no  guarantee 
that  the  recovered  strategy  6t  is  also  sparse.  We  need  an  approximation  algorithm 
that  Ends  near-optimal  strategies  a  and  P  for  Mindy  and  Max  with  the  additional 
guarantee  that  Mindy’s  near  optimal  strategy  6t  is  sparse. 

Here  we  introduce  the  Game-theoretic  Approximate  Matching  Estimator  (GAME) 
algorithm  which  finds  a  sparse  approximation  to  the  min-max  optimal  solution  of 
the  game  defined  in  Equation  (7.2.2).  The  GAME  algorithm  relies  on  the  general 
primal-dual  approach  which  was  originally  applied  to  developing  strategies  for  re¬ 
peated  games  [117]  (see  also  [144]  and  [135]).  The  pseudocode  of  the  GAME  Algo¬ 
rithm  is  provided  in  Algorithm  3. 
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Algorithm  3  GAME  Algorithm  for  Sparse  Approximation  in  £q- norm. 

Inputs:  M-dimensional  vector  f .  M  x  N  matrix  <I>.  number  of  iterations  T,  sparse 
approximation  norm  q,  Bregman  function  1Z  and  regularization  parameter  rj. 
Output:  TV-dimensional  vector  d 

0.  Find  a  point  Q1  G  such  that  V7£(P  1)  =  0 M,  and  set  P1  =  Vsp{ Q1)- 
for  t  —  1, . . . ,  T  do 

1.  Let  rl  =  fLP*  (Requires  one  matrix- vector  multiplication) 

2.  Find  the  index  i  of  one  largest  (in  magnitude)  element  of  r*. 

3.  Let  cP  be  a  1-sparse  vector  with 

Supp(a/)  =  {?'},  and  a\  =  —  rSign  (rj)  . 

(Lemma  7.5:  ct *  =  arg mincteA(r)  £(Pt,ai).) 

4.  Choose  a  Qt+1  such  that 

V77  (Qt+1)  =  V77(P')  +  rj  ($«*  -  /)  . 

5.  Project  Qi+1  into  Ep: 

Pt+1  =  PSp(Qm)  =  arg  min  Bn( P,  Qt+1). 

P  G 


end  for 

6.  Output  d  =  Yjft= i  (yt- 


The  GAME  Algorithm  can  be  viewed  as  a  repeated  game  between  two  players  Mindy 
and  Max  who  iteratively  update  their  current  strategies  Pf  and  cP,  with  the  aim  of 
ultimately  finding  near-optimal  strategies  based  on  a  T-round  interaction  with  each 
other.  Here  we  briefly  explain  how  each  player  updates  his/her  current  strategy  based 
on  the  new  update  from  the  other  player. 

Recall  that  the  ultimate  goal  is  to  find  the  solution  of  the  game 

min  max£(P,a). 

a'eA(r)  PeHp 

At  the  begining  of  each  iteration  t,  Mindy  receives  the  updated  value  P;  from  Max. 
A  greedy  Mindy  only  focuses  on  Max’s  current  strategy,  and  updates  her  current 
strategy  to  cP  =  arg  niinaeA(r)  ^(P,)  <*)■  In  the  following  lemma  we  show  that  this  is 
indeed  what  our  Mindy  does  in  the  first  three  steps  of  the  main  loop. 

Lemma  7.5.  Let  P;  denote  Max’s  strategy  at  the  begining  of  iteration  t.  Let  rf  = 
$TP/  and  let  i  denote  the  index  of  a  largest  (in  magnitude)  element  of  rl.  Let 
of  be  a  1  -sparse  vector  with  Supp(o:i)  =  {i}  and  with  aj  =  —  rSign  (rj).  Then 
of  =  arg minoeeA(r)  £( P\  a). 
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Proof.  Let  6t  be  any  solution  ct  =  argminagA(r)  £(P*,a).  It  follows  from  the  bilin¬ 
earity  of  the  loss  function  (Equation  (6.1.7))  that 

at  =  arg  min  £(P,,a)  =  arg  min  (Pb  #«  —  /)  =  arg  min  ($TPbo:). 

cxEA(t)  aGA(r)  ckEA(t) 

Hence,  Holder  inequality  yields  that  for  every  oft  G  A  (r), 

($TPb«#)  >  -||a#||i||$TPt||oo  >  -r||$TPb|oo-  (7.2.3) 

Now  let  of  be  a  1-sparse  vector  with  Supp(cP)  =  {7}  and  ol\  =  — rSign(rf).  Then 
of  G  A(t),  and 

($TPbat)  =  -r||$TPt||oo. 

In  other  words,  for  of  the  Holder  inequality  of  Equation  (7.2.3)  is  an  equality.  Hence 
of  is  a  minimizer  of  ($TP*,o:).  □ 

Thus  far  we  have  seen  that  at  each  iteration  Mindy  always  finds  a  1-sparse  solution 
of  =  arg  minQ,eA(r)  £(P*,  a).  Mindy  then  sends  her  updated  strategy  of  to  Max,  and 
now  it  is  Max’s  turn  to  update  his  strategy.  A  greedy  Max  would  prefer  to  update  his 
strategy  as  Pm  =  arg  maxpgsp  £(P,a*).  However,  our  Max  is  more  conservative  and 
prefers  to  stay  close  to  his  previous  value  Pb  In  other  words,  Max  has  two  competing 
objectives 

1.  Maximizing  £(P,cU),  or  equivalently  minimizing  — £(P.cP). 

2.  Remaining  close  to  the  previous  strategy  Pf,  by  minimizing  Bti{P  .  P*”1). 

Let 

^(P)  =  -/7£(P,at)  +  H7?(P.Pf), 

be  a  regularized  loss  function  which  is  a  linear  combination  of  the  two  objectives 
above. 

A  conservative  Max  then  tries  to  minimize  a  combination  of  the  two  objectives  above 
by  minimizing  the  regularized  loss  function 

P,+1  =  arg  min  £^(P)  =  arg  min  —r]C( P,  of')  +  Bft P.  P').  (7.2.4) 

P^Ap 

Unfortunately,  it  is  not  so  easy  to  efficiently  solve  the  optimization  problem  of  Equa¬ 
tion  (7.2.4)  at  every  iteration.  To  overcome  this  difficulty,  our  Max  first  ignores 
the  constraint  P<+1  G  Sp,  and  instead  finds  a  global  optimizer  of  £^(P)  by  setting 
V£^(P)  =  0 Mi  and  then  projects  back  the  result  to  via  a  Bregman  projection. 

More  precisely,  it  follows  from  the  Property  (P2)  of  Bregman  distance  (Theorem  7.2) 
that  for  every  P 

V£^(P)  =  -  /)  +  V77(P)  -  V77(Pf), 
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and  therefore  if  Q*  is  a  point  with 


Vi^(Q<)  =  V^P*"1)  +  -q^a*  -  /), 


then  V£ft(Qt)  =  0 M. 

The  vector  Qf  is  finally  projected  back  to  Sp  via  a  Bregman  projection  to  ensure  that 
Max’s  new  strategy  is  in  the  feasible  set  Hp. 


7.3  Analysis  of  the  GAME  Algorithm 


In  this  section  we  prove  that  the  GAME  algorithm  finds  a  near-optimal  solution  for 
the  sparse  approximation  problem  of  Equation  (6.1.3).  The  analysis  of  the  GAME 


algorithm  relics  heavily  on  the  analysis  of  the  generic  primal-dual  approach.  This 


approach  originates  from  the  link-function  methodology  in  computational  optimiza¬ 
tion  [135,  172],  and  is  related  to  the  mirror  descent  approach  in  the  optimization 
community  [202,  26]  .  The  primal-dual  Bregman  optimization  approach  is  widely 
used  in  online  optimization  applications  including  portfolio  selection  [80,  145],  online 
learning  [1],  and  boosting  [176,  76]. 

Let  A  and  B  be  two  convex  sets,  and  let  C  :  A  x  B  — >  K  be  a  loss  function  which  is 
convex  with  respect  to  A,  and  concave  with  respect  to  B.  In  online  convex  optimiza¬ 
tion,  an  online  player  chooses  a  point  a  6  A.  After  the  point  is  chosen,  an  adversary 
chooses  a  point  b  E  B,  and  the  online  player  receives  payoff  C(a,  b).  This  scenario  is 
repeated  for  T  iterations,  and  the  goal  of  the  online  player  is  to  minimize  the  regret 
loss  [143,  262] 


However,  there  is  a  major  difference  between  the  sparse  approximation  problem  and 
the  problem  of  online  convex  optimization.  In  the  sparse  approximation  problem, 
the  set  A  =  A (k,r)  is  not  convex  anymore;  therefore,  there  is  no  guarantee  that  an 
online  convex  optimization  algorithm  outputs  a  sparse  strategy  6t.  Hence,  it  is  not 
possible  to  directly  translate  the  bounds  from  the  online  convex  optimization  scheme 
to  the  sparse  approximation  scheme. 

Moreover,  as  discussed  in  Lemma  7.5  there  is  also  a  major  difference  between  the 
Mindy  players  of  the  GAME  algorithm  and  the  general  Mindy  of  general  online  convex 
optimization  games.  In  the  GAME  algorithm  Mindy  is  not  a  blackbox  adversary  that 
responds  with  an  update  to  her  strategy  based  on  Max’s  update.  Here  Mindy  always 
performs  a  greedy  update  and  finds  the  best  strategy  as  a  response  to  Max’s  update. 
Moreover,  our  Mindy  always  finds  a  1  -sparse  new  strategy.  That  is,  she  looks  among 
all  best  responses  to  Max’s  update,  and  finds  a  1-sparse  strategy  among  them. 
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As  we  will  see  next,  the  combination  of  cooperativeness  by  Mindy,  and  standard  ideas 
for  bounding  the  regret  in  online  convex  optimization  schemes,  enables  us  to  analyze 
the  GAME  algorithm  for  sparse  approximation.  The  following  lemma  bounds  the 
regret  loss  of  the  primal-dual  strategy  in  online  convex  optimization  problems  and  is 
proved  in  [144], 

Theorem  7.6.  Let  q  and  T  be  positive  integers,  and  let  p  =  -f-r.  Suppose  that  1Z  is 
such  that  for  every  P.  Q  G  S p,  &r(P,  Q)  >  ||P  —  Q||2;  and  let 

G  =  max  ||<f>a  —  /|L.  (7.3.1) 

qEA(1,t)  J  "q  K  ' 

Also  assume  that  for  every  P  G  Sp,  we  have  B-r{  P,  P1)  <  D2.  Suppose  ((P1,  a1),  •  •  •  ,  (Pr,  aT )) 
is  the  sequence  of  pairs  generated  by  the  GAME  Algorithm  after  T  iterations  with 

d  =  gVt'  Then 


1 

max  — 
PeHp  T 


T 


£AP.<*‘)< 


1 

T 


T 

E£<p'.q')  + 

t= i 


DG 
2  VT' 


Proof.  Let  P  be  an  arbitrary  point  in  S p.  We  have 

£(P.  a*)  -  £(Pf,  «4)  =  <(P  -  P*),  W  -  /) 

=°  -<(P  -  P4),  V77(Qm)  -  V77(P4))  (7.3.2) 

V 

=  l-  {Bn(P.  P*)  -  Bn{ P,  Qm)  +  M P*,  Qm)) 

<c  l-  (Bn( P.  P*)  -  Bn( P,  Pm)  +  B^Pf  Qt+1))  . 

Equality  (a)  follows  from  the  definition  of  Qf+1  (Step  4  of  Algorithm  3).  Equality 
(b)  is  the  three  point  property  of  Bregman  distances  (Property  (P3)  of  Theorem  7.2), 
and  inequality  (c)  follows  from  the  generalized  Pythagorean  theorem  for  Bregman 
projections  (Theorem  7.4)  as  Pi+1  is  the  Bregman  projection  of  Q<+1  into  (Step  5 
of  Algorithm  3). 

Therefore,  from  the  telescoping  trick  we  have 

t  t  /  T 

£(P,  cd)  -  £(P*,  a*)  —  -  |  Bn{ P,  P1)  -  Bn{ P,  PT+1)  +  Bn( P\  Qt+1) 

t= i  t= i  ^  \  t= i 

(7.3.3) 

<d  — +  -EMpt>Qm))- 

r]  q 
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The  inequality  (d)  follows  from  the  facts  that  Bn{ P.P:/+1)  is  non-negative  and 
BnfP-P1)  <  D2.  Our  next  step  is  to  bound  £>7?(P\  Q<+1).  From  Property  (P4) 
of  Theorem  7.2  we  have 


Bn  Qm))  +  Bn  (Qt+\  P4))  =  ((Q4+1  -  P*),  (V77(Qm)  -  V77(P<)))  (7.3.4) 

=  rf((Q‘+1-P').  (*«'-/)>•  (7-3.5) 

Now  from  Holder  inequality  we  get 

<K(Q‘+1  -  P'),  (*a‘  -  /)}  <  2|||Sa'-/||,||Q‘+1~P‘||p  <  f  ^G2  +  ||Q‘+1  -  P‘||2t  . 

(7.3.6) 

Thus,  by  plugging  back  Equation  (7.3.6)  into  Equation  (7.3.4),  and  recalling  the 
assumption  Bn{ Q4+1.  P4)  >  ||Q<+1  -  P4||2,  we  get 

£^(P\Qt+1))  <^G2.  (7.3.7) 

Finally  plugging  Equation  (7.3.7)  into  (7.3.3)  and  summing  over  all  T  yields 

T  1  r)2  TC2 

E£(  P’“‘)-E£(p,'Q‘)S— +  ^.  (7.3.8) 

t= 1  t= 1  ' 


Equation  7.3.8  is  valid  for  any  rj  and  every  P  G  Hp.  In  particular  by  setting  rj  =  444, 
taking  the  maximum  over  P  we  get 


1 

max  — 
PeHp  T 


T 


X£(p.“‘)< 


1 

T 


T 

££(P*,<*‘)  + 

t=l 


DG 

2\/t' 


(7.3.9) 


□ 


Finally  we  use  Theorem  7.6  to  show  that  the  GAME  algorithm  after  T  iterations 
finds  a  T-sparse  vector  6t  with  near-optimal  value  ||$a  —  f\\q. 

Theorem  7.7.  Let  q  and  T  be  positive  integers,  and  let  p  =  -4-.  Suppose  that  for 
every  P.  Q  G  Sp,  the  function  1Z  satisfies  Bn( P,  Q)  >  ||P  —  Q||2,  and  let 

G  =  max  ll^a-Jll,,.  (7.3.10) 

ckG  A(1,t) 

Also  assume  that  for  every  P  G  Sp?  we  have  Bn{  P  P1)  <  D2.  Suppose  ((P1,  a1),  •  •  •  ,  (Pr,  aT )) 
is  the  sequence  of  pairs  generated  by  the  GAME  Algorithm  after  T  iterations  with 
rj  =  Ai=L .  Let  «  =  4  Y^t=i  at  be  the  output  of  the  GAME  algorithm.  Then  a  is  a 
T-sparse  vector  with  ||ck||i  <  r  and 

DG 

||#a  -  /||,  <  jnin^,  ||$a  -  /||,  +  ^=.  (7.3.11) 
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Proof.  It  follows  from  Step  2.  of  Algorithm  3  that  every  cP  is  1-sparse  and  ||o:t||i  =  t. 
Therefore,  a  —  ^  Y2t=i  ^  can  have  at  most  T  non-zero  entries  and  moreover  ||a||i  < 
h  Ylt=i  Mi  <  r.  Therefore  a.  is  in  A(T,  r). 

Next  we  show  that  the  Equation  7.3.11  holds  for  a.  Let  P  =  Observe 

that 


min  max  C  (P.  ck)  =e  max  min  £  (P.  a)  min  Zlfp.aA  >9  —  min  £(P,,ck) 

a£A(r)  PGHp  P£Sp  o:£A(t)  o:£A(r)  V  /  T  ctgA (r)  y — ' 


>h  - 


T 


T  mO  £(p-«) 
1  ^ — J  ad A(t) 
t=  1 


1 

T 


T 

max  £ 


DG 
2  Vf' 


Equality  (e)  is  the  minimax  Theorem  (Theorem  6.2).  Inequality  (f)  follows  from 
the  definition  of  the  max  function.  Inequalities  (g)  and  (h)  are  consequences  of 
the  bilinearity  of  £  and  concavity  of  the  min  function.  Equality  (i)  is  valid  by  the 
definition  of  c£,  and  Inequality  (j)  follows  from  Theorem  7.6.  As  a  result 


II -  f\\q  =  nrax  £  (P ,  at) 

x  DG 

<  mm  max£P,a:H - -= 

“£A(r)  PeHp  '  2 Vr 


(7.3.12) 


min  II  <l>a!  —  f\\a  + 

ctSA(r)  "  J  "q 


DG 

2y/T' 


□ 

Remark  7.8.  In  general,  different  choices  for  the  Bregman  function  may  lead  to 
different  convergence  bounds  with  different  running  times  to  perform  the  new  projec¬ 
tions  and  updates.  For  instance,  a  multiplicative  update  version  of  the  algorithm  can 
be  derived  by  using  the  Bregman  divergence  based  on  the  Kullback-Leibler  function, 
and  an  additive  update  version  of  the  algorithm  can  be  derived  by  using  the  Bregman 
divergence  based  on  the  squared  Euclidean  function. 
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Part  III 

Expander-Based  Compressed 

Sensing 
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Chapter  8 

Efficient  Compressed  Sensing  using 
Optimized  Expander  Graphs 


In  Chapter  3  we  introduced  compressed  sensing  with  the  goal  of  replacing  conventional 
sampling  with  a  more  general  combination  of  linear  measurement  and  non-linear 
reconstruction  in  order  to  acquire  certain  kinds  of  signals  at  a  rate  significantly  below 
Nyquist. 

Recall  that  the  original  approach  employed  Gaussian  and  Rademacher  random  ma¬ 
trices  satisfying  the  RIP.  However,  as  we  discussed  in  Section  3.4,  in  order  to  store 
the  whole  matrix  in  memory  we  still  need  O(MN)  units  of  storage  which  is  ineffi¬ 
cient.  Also  with  these  matrices  each  matrix-vector  multiplication  requires  O(MN) 
operations.  We  then  introduced  the  partial  Fourier/Hadamard  matrices  as  another 
family  of  random  matrices  satisfying  the  RIP.  The  partial  Fourier/Hadamard  ma¬ 
trices  are  obtained  by  randomly  sampling  rows  of  the  Fourier/Hadamard  matrix. 
These  matrices  require  O  (M  log  IV)  units  of  storage,  but  now  the  number  of  required 
measurements  is  suboptimal,  M  —  Q  (/Hog5 IV )  compared  to  M  —  O  (/clog  (Jp))  for 
Gaussian  and  Rademacher  matrices.  Moreover,  there  is  no  efficient  algorithm  to  verify 
whether  a  given  random  matrix  satisfies  the  RIP  or  not. 

In  this  chapter,  we  will  introduce  efficient  and  sparse  sensing  matrices  that  are  con¬ 
structed  from  the  adjacency  matrices  of  expander  graphs.  We  will  see  that  expander 
graphs  do  not  have  the  storage  and  computational  issues  of  dense  random  matrices, 
and  moreover  explicit  constructions  for  such  matrices  exist. 

The  idea  of  using  expander  graphs  in  compressed  sensing  is  based  on  the  connections 
between  coding  theory  and  compressed  sensing  (See  Section  3.2.)  In  1996,  Sipser  and 
Spiclman  [228]  used  expander  graphs  to  build  a  family  of  linear  error-correcting  codes 
with  linear  decoding  time  complexity.  These  codes  belong  to  class  of  error  correcting 
codes  called  Low  Density  Parity  Check  (LDPC)  Codes.  Later,  Xu  and  Hassibi  [258, 
259]  generalized  the  decoding  of  expander  codes  to  the  held  of  real  numbers  and 
proposed  the  first  reconstruction  algorithm  for  expander-based  compressed  sensing. 
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Following  [258,  259],  we  will  show  how  random  dense  matrices  can  be  replaced  by 
the  adjacency  matrices  of  an  optimized  family  of  expander  graphs,  thereby  reducing 
the  space  complexity  of  matrix  storage  and  the  time  complexity  of  recovery  to  a  few 
simple  iterations.  The  main  idea  is  that  we  study  expander  graphs  with  expansion 
coefficient  beyond  the  |  that  was  considered  in  [258,  259].  Our  results  have  interesting 
connection  with  the  sequential  decoding  of  expander  codes,  and  generalize  the  results 
of  Sipser  and  Spielman[228],  and  Xu  and  Hassibi  [258,  259]. 

In  the  remainder  of  this  chapter,  we  first  formally  define  expander  graphs,  and  describe 
a  few  key  properties  that  we  later  use  in  expander-based  compressed  sensing.  In 
Section  8.3  we  propose  an  efficient  combinatorial  algorithm  that  efficiently  recovers 
any  ^-sparse  vector  oc*  from  f  =  <E*a*  after  at  most  2k  simple  iterations. 

Our  proposed  algorithm  generalizes  the  algorithm  of  [258,  259]  to  expander  graphs 
with  expansion  coefficient  beyond  |.  The  key  difference  is  that  now  the  progress  in 
each  iteration  is  proportional  to  logiV,  as  opposed  to  a  constant  in  [258,  259]  We 
then  describe  how  the  algorithm  can  be  implemented  efficiently  using  simple  data 
structures. 

In  Section  8.4  we  describe  the  connections  between  our  proposed  algorithm  and  the 
SMP  algorithm,  which  was  proposed  subsequently  by  Berinde,  Indyk  and  Ruzic  [31], 
and  can  recover  a  sparse  approximation  to  any  vector  a*  e  in  the  l\jl\  approxi¬ 
mation  settings  of  Section  3.5. 


8.1  What  is  an  Expander  Graph? 

We  start  by  defining  an  unbalanced  bipartite  vertex- expander  graph  [148]. 

Definition  8.1.  Let  Q  be  a  bipartite  graph  with  variable  ( left-side )  nodes  V ,  check 
(right-side)  nodes  C,  and  edges  E.  We  say  that  Q  is  a  (k,  e,  d)  -expander  if 

1.  Q  is  left  regular  with  left  degree  d.  That  is,  every  variable  node  is  connected  to 
exactly  d  check  nodes, 

2.  for  any  subset  S  of  the  variable  nodes  V  with  |Sj  <  k,  the  set  of  neighbors 
M(S)  of  S  has  size  |A/"(.S,)|  >  (1  —  e)d  |Sj. 

Figure  8.1  illustrates  such  a  graph.  Intuitively  a  bipartite  graph  is  an  expander  if  any 
sufficiently  small  subset  of  its  variable  nodes  has  a  sufficiently  large  neighborhood.  In 
the  compressed  sensing  setting,  V  (respectively  C )  will  correspond  to  the  components 
of  the  original  signal  (respectively  its  compressed  representation).  Hence,  for  a  given 
N  =  |V|  and  sparsity  level  k,  an  “optimized”  expander  should  have  M  =  |Cj, 
d,  and  e  as  small  as  possible,  while  k  should  be  as  close  as  possible  to  M.  The 
following  proposition,  proved  using  the  probabilistic  method  [9,  24],  is  well-known  in 
the  literature  on  expanders: 
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\Af(S)\  >  (1  —  e)d |5| 


ms) 


Figure  8.1:  A  (k,  e,  d)-expander.  In  this  example,  the  green  nodes  correspond  to  V, 
the  blue  nodes  correspond  to  C,  the  yellow  oval  corresponds  to  the  set  S  C  V,  and 
the  orange  oval  corresponds  to  the  set  Af(S)  C  C. 

Proposition  8.2  (Existence  of  optimized  expanders).  For  any  1  <  k  <  y  and  any 
e  G  (0, 1),  there  exists  a  (k,e,d)- expander  with  left  degree  d  =  O  ^log Wfc)  j  an^  right 

set  size  M  —  O  ^  k  los j  _ 

As  a  result,  by  using  such  expander  graph  we  can  get  optimality  M  —  O  ^  k log l^A)  ^  jn 
the  number  of  measurements.  Unfortunately,  the  problem  of  deterministic  construc¬ 
tion  of  expanders  from  Definition  8.1  is  only  solved  in  the  special  case  that  k  =  Q(N) 
[61].  However,  it  can  be  shown  that,  with  high  probability,  any  d-regular  random 
graph  with 

d  =  O  (MEM  j  and  M  = 

satisfies  the  required  expansion  property  [28].  Thus,  on  the  one  hand,  it  may  suffice 
to  use  random  bipartite  regular  graphs  in  many  practical  applications.  On  the  other 
hand,  there  exists  an  explicit  construction  for  a  class  of  expander  graphs  that  comes 
very  close  to  the  guarantees  of  Proposition  8.2.  This  construction,  due  to  Guruswami, 
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Umans,  and  Vadhan.  [139],  uses  Parvaresh-Vardy  codes  [208]  and  has  the  following 
guarantees: 


Proposition  8.3  (Explicit  construction  of  high-quality  expanders).  For  any  positive 
constant  (3 ,  and  any  N,k,e,  there  exists  a  deterministic  construction  of  a  (k,e,d)- 


expander  graph  with  d 


and  M  =  O(d2k1+0). 


Next  we  define  a  few  combinatorial  properties  of  the  expander  graphs  that  we  will 
use  later  in  our  analysis. 

Theorem  8.4  (Unique  neighbor  nodes).  Let  Q  be  a  (k,  e,  d)-expander  graph.  Let  S  be 
a  subset  of  the  variable  nodes  with  (S')  <  k,  and  let  Af(S)  denote  the  set  of  neighbors 
J\f(S)  of  S .  Let  J\f=i(S)  denote  the  set  of  those  nodes  inJ\f(S)  that  are  connected  to 
a  single  node  in  S .  Then 


W=X{S)\  >  (l-2e)d|S|. 

Proof.  Let  Af>i(S ),  consist  of  those  nodes  in  Af(S)  that  are  connected  to  more  than 
a  single  node  in  S.  First  observe  that  any  node  in  J\f(S)  is  either  in  Af=\(S)  or  in 
Af>i(S).  Therefore,  it  follows  from  the  expansion  property  of  the  graph  that 

\M=i(S)\  +  W>i(S)\  =  mS)\  >  (1  -  e)d  |S|. 

Furthermore,  since  every  node  in  M=\(S)  is  connected  to  exactly  one  node  in  S,  and 
every  node  in  J\f>i(S)  is  connected  to  more  than  one  node  in  S,  by  counting  the  edges 
connecting  S  and  A r(S),  we  have 

\Af=1(S)\  +  2\Af>1(S)\<d\S\. 

The  result  then  follows  by  combining  the  latter  two  inequalities.  □ 

The  set  J\f=\ (S)  is  called  the  set  of  unique  neighbor  nodes  of  S 

Corollary  8.5.  Let  Q  be  a  (k,e,d)- expander  graph.  Let  S  be  a  subset  of  the  variable 
nodes  with  |SI|  <  k.  Let  Af=i(S)  denote  the  set  of  those  nodes  in  Af(S)  that  are 
connected  to  a  single  node  in  S .  Then,  there  exists  a  node  in  S  that  is  connected  to 
more  than  (1  —  2 e)d  nodes  in  A f=i(S). 

Proof.  Assume  that  every  node  in  S  has  at  most  (1  —  2 e)d  unique  neighbors.  Then 
|A/’=i(5)|  <  (1  —  e)d|5|,  which  contradicts  Theorem  8.4.  □ 

Remark  8.6.  Corollary  8.5  guarantees  that  every  subset  S  of  variable  nodes  of  an 
expander  graph  of  size  at  most  k  has  at  least  one  node  with  more  than  (1  —  2 e)d  unique 
neighbors  with  respect  to  S .  We  will  use  this  corollary  in  Section  8.3  in  the  analysis 
of  our  reconstruction  algorithm. 
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The  following  theorem  is  a  generalization  of  the  unique  neighborhood  property,  and 
proves  that  in  every  set  S  of  size  at  most  k  there  are  many  nodes  with  a  significantly 
large  number  of  unique  neighbors  with  respect  to  S. 

Theorem  8.7.  Let  Q  be  a  (k,e,d)- expander  graph.  Let  S  be  a  subset  of  the  variable 

nodes  with  |Sj  <  k,  and  let  Af(S)  C  C  denote  the  set  of  neighbors  M{S)  of  S .  Let 

A f=i(S)  denote  the  set  of  unique  neighbor  nodes  of  S .  For  every  k  >  2e  define 

S'  =  {ves  :  \AT(v)  nJ\f=1(S)\  >  (1  -  K)d.  (8.1.1) 

Then  |5'|  >  (l  -  f )  \S\. 

Proof.  We  prove  Theorem  8.7  by  double-counting  the  size  of  Af=\(S).  Observe  that 
every  node  in  S  —  S'  has  at  most  (1  —  n)d  unique  neighbors.  Moreover,  since  the 
graph  is  d-regular,  every  node  in  S'  has  at  most  d  unique  neighbors.  Therefore, 

(|S|  -  \S'\)  (1  -  K)d  +  |5'|d  >  \M=i{S)\. 

Now  by  using  Theorem  8.4  we  get 

(\S\  -  |S'|)  (1  -  n)d  +  \S'\d  >  (1  -  2 e)d  |5|.  (8.1.2) 

By  simplifying  Equation  (8.1.2)  we  get  nd\S'\  <  (2e  —  /c)d|<Sj.  O 

8.2  Compressed  Sensing  and  RIP-1  Property 

Let  $  be  the  M  x  N  adjacency  matrix  of  a  (k,  e,  d)  expander  graph.  Expander-based 
compressed  sensing  uses  the  matrix  $  as  the  sensing  matrix.  As  the  expander  graph 
is  d-regular,  every  column  of  $  has  at  most  d  non-zero  entries  which  all  have  value 
1.  Therefore,  storing  the  whole  matrix  requires  only  0(dN )  =  O  [N logy)  random 
bits.  Moreover,  each  forward  matrix-vector  multiplication  u  =  Qv  requires  only 
0(dN)  operations,  as  every  entry  of  v  updates  at  most  d  entries  of  u.  Similarly, 
the  adjoint  multiplication  v  =  <f>T,u  also  requires  only  O(dN)  operations.  Therefore, 
sparse  sensing  matrices  constructed  from  expander  graphs  have  significant  storage 
and  computational  advantages  over  dense  Gaussian  and  Rademacher  matrices. 

Sensing  matrices  based  on  random  expander  graphs  of  Proposition  8.2  have  the  ex¬ 
tra  advantage  that  their  number  of  measurements  is  optimal  M  =  O  (k  log^).  In 
contrast,  sensing  matrices  based  on  explicit  expander  graphs  of  Proposition  8.3  have 
deterministic  constructions. 

Since  every  entry  of  $  is  either  zero  or  one,  a  result  of  Chandar  [70]  states  that 
$  cannot  satisfy  the  RIP  property  of  Definition  3.5.  However,  a  similar  Restricted 
Isometry  Property  in  the  i\  norm  (known  as  the  RIP-1  property)  can  be  derived  from 
the  expansion  property  and  will  guarantee  the  uniqueness  of  sparse  representation. 
The  RIP-1  property  is  proved  by  Berinde  et.  al.  in  [29]. 
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Lemma  8.8  (RIP-1  property  of  the  expander  graphs  [29]).  Let  $  be  the  M  x  N 
adjacency  matrix  of  a  (k,  e,  d)  expander  graph  Q .  Then  for  any  k-sparse  vector  a.  e  RN 
we  have: 

(1  —  2e)d||a||1  <  ||<l>a||i  <  d||a||i  (8.2.1) 

The  full  recovery  property  now  follows  immediately  from  Lemma  8.8  and  guarantees 
that  expander-based  compressed  sensing  is  at  least  information-theoretically  possible. 

Theorem  8.9  (Full  recovery).  Let  m  be  a  positive  integer.  Suppose  $>mxn  is  the 
adjacency  matrix  of  an  ((m  +  1  )k,  e,  d)  expander  graph.  Suppose  a.  is  a  k-sparse  and 
a.'  is  an  mk-sparse  vector,  such  that  <ba  =  fhah  Then  a  =  ot' . 

Proof.  Let  z  =  a  —  a!.  We  have 


||z|o  <  IHIo  +  Iloilo  <  {rn  +  1  )k. 
Now  from  Lemma  8.8  we  have: 

1 


la  —  adb  < 


(1  -  2e) 


|<ha  -  $a'||i  =  0, 


hence  a  =  ah  □ 

Note  that  the  proof  of  the  above  theorem  essentially  says  that  the  adjacency  matrix 
of  a  ((m  +  l)fc,  e,  d)  expander  graph  does  not  have  a  null  vector  that  is  (m  +  l)k  sparse. 
If  m  —  2  then  Theorem  8.9  guarantees  that  no  two  ^-sparse  vectors  can  be  mapped 
to  the  same  measurement  vector,  and  compressed  sensing  is  information  theoretically 
possible.  We  will  also  give  a  direct  proof  of  this  result  (which  does  not  appeal  to 
RIP-1)  since  it  gives  a  flavor  of  the  main  arguments  of  the  next  section. 

Lemma  8.10  (Null  space  of  <h).  Let  m  be  a  positive  integer,  Suppose  $*mxn  is  the 
adjacency  matrix  of  an  ((m  +  1  )k,  e,  d)  expander  graph  with  e  <  \.  Then  any  nonzero 
vector  in  the  null  space  of  <h,  i.e.,  any  z  ^  0  such  that  <hz  =  0,  has  more  than 
(m  +  l)k  nonzero  entries. 

Proof.  Define  S  to  be  the  support  set  of  z.  Suppose  that  0  has  at  most  (m  +  l)k 
nonzero  entries,  i.e.,  that  |£j  <  (m  +  1  )k.  Let  Af=i(S)  denotes  the  set  of  those  nodes 
in  Af(S)  that  are  connected  to  a  single  node  in  S.  Then  from  Theorem  8.4  we  have 
|jV=i(5)|  >  (1  —  2e)d\S\  >  0.  This  implies  that  there  is  at  least  one  entry  of 
which  is  only  connected  to  one  entry  of  the  support  of  z,  and  therefore  has  non-zero 
value.  However,  this  contradicts  the  fact  that  $2  =  0  and  so  z  must  have  more  than 
(m  +  l)k  nonzero  entries.  □ 

Table  8.1  compares  expander-based  sensing  matrices  satisfying  the  RIP-1  property, 
with  dense  sensing  matrices  satisfying  the  RIP-2  property.  Expander-based  matrices 
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Table  8.1:  Comparison  between  different  properties  of  sensing  matrices  satisfying  the 
RIP-2  property  with  the  same  properties  of  expander-based  matrices  satisfying  the 
RIP-1  property.  All  bounds  ignore  the  0(.)  constants. 


Matrix 

Number  of 

measurements 

Memory 
(random  bits) 

Matrix-vector 

multiplication 

Random  vs. 
Deterministic 

RIP 

Gaussian 

[21] 

H°g(  f) 

kN log  (f) 

kN  log  (f) 

Random 

RIP-2 

Partial 
(Fourier)  [218] 

k  log5  N 

k  log6  N 

NlogN 

Random 

RIP-2 

Incoherent 
Frames  [11,  15] 

k2 

- 

NlogN 

Deterministic 

RIP-2 

Expander 
Graphs  [159,  29] 

H°g(f) 

IV  log  (f) 

IV  log  (f) 

Random 

RIP-1 

Expander 
Graphs  [139] 

_  2(l+/3) 

/c1+/3(logV)  f3 

- 

,  s  (1+/3) 

N  (log  IV)  p 

Deterministic 

RIP-1 

are  superior  in  terms  of  the  number  of  measurements,  storage,  computation,  and 
having  explicit  constructions.  In  the  next  sections  we  show  that  expander-based 
compressed  sensing  is  also  computationally  possible  by  providing  efficient  and  robust 
sparse  recovery  algorithms. 


8.3  Efficient  Sparse  Recovery  in  the  Noiseless  Regime 

In  this  section  we  propose  a  simple  recovery  algorithm  which  recovers  any  fc-sparse 
vector  a*  from  the  low-dimensional  vector  in  only  2k  simple  iterations  and  a 
total  running  time  of  O  (IV  log  A) .  The  proposed  algorithm  is  a  simple  iterative  algo¬ 
rithm  that  starts  by  setting  the  all-zero  vector  as  its  initial  guess.  At  every  iteration 
the  algorithm  updates  only  one  coordinate  of  the  guess  vector  by  selecting  a  coordi¬ 
nate  whose  neighbors  mostly  have  the  same  gap  value  (defined  bellow).  The  algorithm 
keeps  on  updating  the  guess  vector  for  at  most  2k  iterations.  The  pseudocode  of  our 
proposed  algorithm  is  provided  in  Algorithm  4. 

Before  proving  the  result,  we  introduce  some  notations  used  in  the  recovery  algorithm 
and  in  the  proof. 

In  Algorithm  4  the  gap  is  defined  as  follows. 

Definition  8.11  (gap).  Let  a*  be  the  original  signal  and  f  =  <&ot* .  Furthermore, 
let  cF  be  our  estimate  for  a*  after  t  iterations  of  Algorithm  4  .  For  each  variable 
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Algorithm  4  Expander  Recovery  Algorithm  for  Sparse  Signals 

Inputs:  An  M  dimensional  vector  /,  and  the  MxN  matrix  $  which  is  the  adjacency 

of  an  expander  graph. 

Output:  An  N  dimensional  vector  6l. 

1:  Initialize  a1  =  Oat  and  f1  =  f. 

2:  for  t  —  1,  •  •  •  ,  2k  do 
3:  if  =  Om  then 

4:  output  cxf  and  exit. 

5:  else 

6:  For  each  variable  node  j,  set 

9j  =  Median  ({/■  :  i  e  A f(j)})  . 

7:  Find  a  variable  node  j  such  that  at  least  (1  —  2e)  d  of  the  measurements  it 

participates  in,  have  identical  non-zero  gap  value  g). 

8:  Set 

d.t+i  =  {  <\rl  ~  f/j  ^  f  =  J  \ 

3  y  df  Otherwise  J 

9:  Set 

/‘+1  =  f-  fa. 

10:  end  if 

11:  end  for 

node  j ,  we  define  the  gap  value  gj  as: 

g)  =  Median  ({/*  :  i  e  A . 

That  is,  each  vertex  j  selects  the  entries  /*  where  i  is  a  neighbor  of  j  in  Q ,  and  then 
computes  the  median  g*  of  those  d  entries. 

Definition  8.12.  At  each  iteration  t.  Gl  is  the  support  of  the  residual  vector  f1: 

G*  =  Supp(/4)  =  Supp(/  -  $6:*), 

similarly  S *  is  the  support  of  the  difference  between  the  true  (unknown)  vector  a*, 
and  our  candidate  of : 

Sl  =  Supp(o:t  —  a*)  =  {j  :  hj'  ^  a*}. 

Now  we  are  ready  to  state  the  main  result: 

Theorem  8.13  (  Expander  Recovery  Algorithm  ).  Let  4>mXn  be  the  adjacency  matrix 
of  a  (3 k,  e,  d)  expander  graph,  where  e  <  1/a.  Then,  for  any  k-sparse  signal  ol* ,  given 
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f  =  $a*,  the  expander  recovery  algorithm  (Algorithm  4)  recovers  a*  successfully  in 
at  most  2k  iterations. 

The  proof  consists  of  the  following  lemmas. 

•  The  algorithm  never  gets  stuck,  and  one  can  always  find  a  coordinate  j  that  is 
connected  to  at  least  (1  —  2 e)d  parity  nodes  with  identical  non-zero  gaps. 

•  With  certainty  the  algorithm  will  stop  after  at  most  2k  rounds.  Furthermore, 
by  choosing  e  small  enough  the  number  of  iterations  can  be  made  arbitrarily 
close  to  k. 

We  need  the  following  lemmas  to  prove  Theorem  8.13. 

Lemma  8.14  (progress).  Suppose  at  each  iteration  t,  | S'* |  <  2k.  Then  always  there 
exists  a  variable  node  j  such  that  at  least  (1  —  2 e)d  of  its  neighbor  check  nodes  have 
the  same  non-zero  gap  value  g j. 

Proof.  Since  4>  is  the  normalized  matrix  of  a  (2k,  e,  d)  expander  graph,  and  \  St\  <  2k, 
it  follows  from  Corollary  8.5  that  there  exists  a  coordinate  j  in  S *  that  is  uniquely 
connected  to  at  least  (1  —  2 e)d  check  nodes,  in  other  words  no  other  non-zero  variable 
node  in  is  connected  to  these  nodes.  This  immediately  implies  the  lemma.  □ 

Lemma  8.15  (gap  elimination).  At  each  step  t  if  | S'* |  <  2k  then  \ Gt+1  I  <  \G*\  ~ 
(1  —  4  e)d 

Proof.  By  the  previous  lemma,  if  IS1*)  <  2k,  there  always  exists  a  node  j  that  is 
connected  to  at  least  (1  —  2 e)d  nodes  with  identical  nonzero  gap,  and  hence  to  at 
most  2 ed  nodes  possibly  with  zero  gaps.  Adding  the  gap  value  g j  to  the  current  value 
of  this  variable  node  sets  the  gaps  on  these  uniquely  connected  neighbors  of  j  to  zero, 
but  it  may  make  some  zero  gaps  on  the  remaining  2 ed  neighbors  non-zero.  So  at  least 
(1  —  2 e)d  coordinates  of  Gl  will  become  zero,  and  at  most  2 ed  its  zero  coordinates 
may  become  non-zero.  Hence 

|Gt+1|  <  \Gl\  -  (1  —  2e)d  +  2ed  =  \Gl\  -  (1-4 e)d.  (8.3.1) 

□ 

The  following  lemma  provides  a  direct  connection  between  the  size  of  Gl  and  the  size 
of  S\ 

Lemma  8.16  (connection).  If  at  iteration  t,  IS*)  <  2k,  then  (1  —  2e)d\St\  <  \ G*\. 
Proof.  Since  jS1*!  <  2k,  by  Theorem  8.4 

l-ALitS*)!  >  (1  —  2e)d\St\. 

Also,  each  node  in  A/"=i(Slt)  has  non- zero  gap  and  so  is  a  member  of  G*.  □ 
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Lemma  8.17  (preservation).  At  each  step  t  <  2 k,  after  running  the  algorithm 

we  have  |5t+1|  <  2k. 

Proof.  Since  at  each  step  we  are  only  changing  one  coordinate  of  a4,  we  have  |<S't+1|  = 
|5*|  +  1,  so  we  only  need  to  prove  that  St+1  ^  2k. 

Suppose  for  a  contradiction  that  |<S,t+1|  =  2k,  and  partition  Af(St+1)  into  two  disjoint 
sets: 

The  vertices  in  Af(St+1)  that  are  connected  only  to  one  vertex  in 
The  other  vertices  (that  are  connected  to  more  than  one  vertex  in 

The  argument  is  similar  to  that  given  in  Theorem  8.4;  by  double  counting  the  number 
of  vertices  in  Af=i(St+1)  and  A>i(St+1)  one  can  show  that 

|A/”=i(5t+1)|  >  (1  -  2e)  d  |Si+1|  =  (1  -  2e)  d  2k  (8.3.2) 

Now  we  have  the  following  facts: 

•  |A^=1(S*+1)|  <  |Gt+1|  :  Coordinates  in  Af=i(St+1)  are  connected  uniquely  to 
coordinates  in  St+1,  hence  each  coordinate  in  AT=i(St+1)  has  non-zero  gap. 

•  |Gt+1j  >  | Gr1 1 :  gap  elimination  from  Lemma  8.15. 

•  I G1  |  <  kd:  since  a*  is  k- sparse  and  a1  is  the  all-zero  vector,  ctl  and  ol*  differ 
in  at  most  k  coordinates.  Therefore,  since  the  graph  is  d-regular,  ‘Lei1  and  $«* 
can  differ  in  at  most  kd  coordinates. 

As  a  result  we  have: 

(1  -  2e)2  dk  <  AT  i ( St+ 1  )f  <  |Gt+1|  <  |GX|  <  kd  (8.3.3) 

This  implies  e  >  which  contradicts  the  assumption  e  <  □ 

Proof  of  the  Theorem  8.13.  Preservation  (Lemma  8.17)  and  Progress  (Lemma  8.14) 
together  immediately  imply  that  the  algorithm  will  never  get  stuck.  Also  by  Lemma 
8.15  we  had  shown  that  G1  <  kd  and  |Gt+1|  <  |G*|  —  (1  —  4 e)d.  Hence  after  at 
most  T  =  steps  we  will  have  |GT|  =  0  and  this  together  with  the  Connection 

Lemma  implies  that  |S'T|  =  0,  which  is  the  exact  recovery  of  the  original  signal. 

Note  that  we  have  to  choose  e  <  \,  and  as  an  example,  by  setting  e  =  |  the  recovery 
needs  at  most  2k  iterations.  □ 


1.  A f=1(St+1)-. 

st+1. 

2.  A A>1(5t+1): 
St+1). 
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Algorithm  4  is  an  iterative  algorithm  that  consists  of  at  most  2k  simple  iterations. 
Each  iteration  can  be  implemented  efficiently  (see  [258]  )  since  the  adjacency  matrix 
of  the  expander  graph  is  sparse  with  all  entries  0  or  1. 

The  efficiency  of  the  algorithm  can  also  be  improved  by  using  a  priority  queue  data- 
structure.  The  idea  is  to  use  preprocessing  as  follows:  For  each  variable  node  j 
compute  the  median  of  its  neighbors  =  Median  ({/,;  :  i  £  A/"(j)})  and  also  compute 
the  number  of  neighbors  with  the  same  value  gj.  Note  that  if  a  node  has  (1  —  2 e)d 
unique  neighbors,  their  median  should  also  be  among  them.  Then  construct  the 
priority  queue  based  on  the  values  gj,  and  at  each  iteration  extract  the  root  node 
from  the  queue,  perform  the  gap  elimination  on  it,  and  then,  if  required,  make  the 
correction  on  corresponding  dD  variable  nodes,  where  d  is  the  left-degree  and  D  is 
the  maximum  right  degree  of  the  expander.  The  main  computational  cost  of  this 
variation  of  the  algorithm  will  be  the  cost  of  building  the  priority  queue  which  is 
0(N  log  ^);  finding  the  median  of  d  elements  can  be  done  in  0(d)  and  building  a 
priority  queue  requires  linear  computational  time. 

8.4  Sparse  Matching  Pursuit 

Thus  far  we  have  introduced  efficient  iterative  algorithms  for  recovery  of  (almost) 
/c-sparse  vectors.  An  important  next  step  is  to  generalize  these  algorithms  to  provide 
sparse  approximations  for  arbitrary  vectors  in  WN  in  the  settings  of  Section  3.5.  More 
precisely,  let  $  be  the  adjacency  matrix  of  an  expander  graph.  The  goal  is  to  design 
an  efficient  recovery  algorithm  such  that  for  every  data  vector  a*  €  M.N ,  and  noise 
vector  eM  €  MM,  given  f  =  $a*  +  eM,  the  algorithm  can  find  a  sparse  vector  6t 
close  to  the  best  fc-term  approximation  H*.(a*)  of  ex.* . 

Expander  Matching  Pursuit  (EMP)  [153]  was  the  Erst  expander-based  recovery  algo¬ 
rithm  capable  of  solving  the  general  sparse  approximation  problem.  The  key  feature 
of  the  EMP  algorithm  is  that  sparse  recovery  requires  near  linear  O  (At  log  y)  oper¬ 
ations,  while  still  using  O  (/clog  y)  measurements.  Moreover,  the  algorithm  provides 
l\jl\  sparse  approximation  guarantees.  That  is,  for  every  cx*  £  RN  and  eM  £  MM, 
given  f  =  $cx*  +  eM,  the  algorithm  Ends  a  k- sparse  vector  &  with 


However,  the  empirical  performance  of  EMP  is  less  impressive.  The  algorithm  re¬ 
quires  M  =  5000  measurements  to  recover  random  signed  50-sparse  signals  of  length 
N  =  20000,  whereas  the  convex  optimization  method  with  random  Gaussian  matrices 
requires  only  about  M  =  450  measurements  [31,  98]. 

Sparse  Matching  Pursuit  (SMP)  [31]  is  an  iterative  message-passing  algorithm  de¬ 
signed  to  overcome  the  empirical  suboptimality  of  the  EMP  algorithm.  The  SMP 
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algorithm  shares  significant  similarities  with  Algorithm  4  provided  in  Section  8.3  for 
recovering  sparse  vectors,  and  can  be  viewed  as  a  generalization  of  Algorithm  4. 

In  each  iteration,  the  algorithm  estimates  the  difference  between  the  current  ap¬ 
proximation  ol  and  the  true  vector  ol*  from  the  measurement  error  t&d:4  —  /.  The 
estimation  is  obtained  by  using  the  median  estimator  gl  as  in  Algorithm  4.  The  data- 
domain  approximation  ol  is  updated  by  g t,  and  the  process  is  repeated.  However, 
the  key  difference  between  the  SMP  algorithm  and  Algorithm  4  is  that  in  contrast  to 
Algorithm  4  which  only  updates  one  coordinate  of  ol'  at  each  iteration,  SMP  updates 
2 k  coordinates  of  a t  simultaneously  in  a  greedy  manner.  The  pseudocode  of  the  SMP 
algorithm  is  provided  in  Algorithm  5. 


Algorithm  5  Sparse  Matching  Pursuit  (SMP) 

Inputs:  An  M  dimensional  vector  /,  the  M  xN  expander  matrix  and  the  number 
of  iterations  T. 

Output:  An  N  dimensional  vector  at. 

1:  Initialize  A1  =  0^  and  f1  =  f . 

2:  for  t  —  1,  •  •  •  ,  T  do 

3:  For  each  variable  node  j,  set  =  Median  ({/■  :  i  6  J\f(j)}) . 

4:  Set  h*  =  H 2fc(fl'i)- 

5:  Set  ctt+1  =  +  h*). 

6:  Set  ft+l  =  f  -  ®at+1. 

7:  end  for 

8:  Output  at  =  ciT+1. 


The  following  theorem  summarizes  the  performance  guarantees  of  the  SMP  algorithm. 

Theorem  8.18  ([31]).  Let  $  be  the  adjacency  of  an  ( 0(k ),  e,  d)  expander  graph  with 
sufficiently  small  e.  Then,  there  exists  a  constant  ksmp  depending  only  on  e,  such 
that  for  any  data  vector  at*  £  RN  and  noise  vector  eM  £  given  f  =  +  eM, 

the  SMP  algorithm  after  T  iterations  finds  a  k-sparse  vector  at  with 

|| A  —  a* ||i  =  -  ^  4ksmp  (\\a*  ~  H/C(a*)||1  +  - — • 

Corollary  8.19.  Let  $  be  the  adjacency  of  an  ( 0(k),e,d )  expander  graph  with  suf¬ 
ficiently  small  e.  Then,  there  exists  a  constant  ksmp  depending  only  on  e,  such  that 
for  any  data  vector  ol*  £  RN  and  noise  vector  eM  €  given  f  =  <&«*  +  eM,  the 
SMP  algorithm  after 


T 


log2 


Hfc(a*)|b  + 
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iterations  finds  a  k-sparse  vector  &  with 


||o:  —  o:* || !  —  (1  +  4fi:sMp)  ^||a*  —  Hfc(a*)||1  +  - — [f^j  • 

Therefore,  the  SMP  algorithm  provides  an  i\ji\  best  fc-term  approximation  guar¬ 
antee.  The  running  time  of  the  SMP  algorithm  is  by  a  logarithmic  factor  higher 
than  EMP;  however,  it  has  better  empirical  performance  and  requires  significantly 
fewer  measurements.  For  instance,  M  =  2000  measurements  are  enough  successfully 
recover  50-sparse  signals  of  dimension  20000. 

Empirical  performance  can  be  further  improved  via  the  Sequential  Sparse  Matching 
Pursuit  (SSMP)  [30]  which  is  a  recent  greedy  variant  of  SMP.  In  contrast  to  the 
original  SMP  algorithm,  SSMP  performs  sequential  updates  by  always  selecting  the 
best  update  first.  In  the  above  setting  M  =  1400  measurements,  rather  than  M  = 
2000,  are  sufficient  for  successful  recovery.  Nevertheless,  the  running  time  of  the 
SSMP  algorithm  is  by  a  logarithmic  factor  slower  than  SMP. 

Remark  8.20.  Note  that  more  efficient  sparse  approximation  algorithms  exist  for 
the  special  case  where  the  non-zero  entries  of  the  sparse  signal  have  all  positive  values 
[169,  71]. 

Expander-based  compressed  sensing  with  the  reconstruction  algorithms  mentioned 
in  this  chapter  provide  for  all  sparse  approximation  guarantees.  This  means  that 
the  deterministic  (random)  expander  sensing  matrix  combined  with  any  of  the  recon¬ 
struction  algorithms  mentioned  above  guarantees  t\jl\  best  fc-term  approximation 
for  every  vector  ot*  G  surely  (or  with  high  probability).  In  contrast,  another 
sparse  recovery  setting  which  was  evolved  in  the  context  of  data-streaming,  provides 
a  weaker  for- each  guarantee  [123]. 

In  this  setting,  the  sensing  matrix  $  is  a  sparse  matrix  chosen  at  random  from  some 
distribution  and  for  each  vector  a*  G  M.N ,  the  recovery  algorithm  successfully  finds 
a  sparse  approximation  to  a*  with  probability  at  least  1  —  -k.  Examples  of  such 
algorithms  which  only  require  0(k  log  N)  measurements  are  [79,  72]  which  provide 
instance  optimal  £2/^2  guarantees  in  0(N  log  N)  running  time,  and  [125]  which  pro¬ 
vides  instance  optimal  £\jl\  guarantees  in  O  (Poly (k,  log  N))  running  time. 

We  note  that  even  though  these  for- each  algorithms  provided  tighter  instance  optimal 
guarantees  or  are  faster,  they  are  not  typically  resilient  to  measurement  noise.  This 
imposes  an  important  difficulty  on  using  them  in  compressed  sensing  applications 
other  than  the  data-streaming  application. 
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Chapter  9 


A  Game  Theoretic  Approach  to 
Expander-based  Compressive 
Sensing 

9.1  RIP-1  and  ^-Minimization 

In  Chapter  8  we  introduced  expander  graphs  and  proposed  efficient  algorithms  for 
recovery  of  (almost)  sparse  signals.  We  also  described  combinatorial  sparse  recovery 
algorithms  with  t\ji\  best  /e-term  approximation  guarantees.  However,  there  is  a 
major  drawback  with  these  algorithms  if  the  signal  is  not  almost  sparse.  While 
they  are  efficient  and  rather  easy  to  implement,  their  approximation  guarantees  are 
meaningful  only  in  extremely  large  dimensions.  The  big  O  notation  in  their  t\/t\ 
guarantees  hides  large  constants,  making  these  algorithms  only  suitable  for  extremely 
high-dimensions  or  when  the  SNR  is  significantly  high. 

On  the  other  hand,  in  Section  3.3  we  mentioned  the  .^-minimization  method  as  a 
robust  sparse  recovery  scheme  that  depends  on  the  geometry  of  sensing  matrices 
satisfying  the  RIP  in  the  i2  norm.  It  is  natural  to  ask  is  whether  it  is  possible  to 
design  stable  sparse  recovery  algorithms,  such  as  ^i-minimization,  that  rely  on  the 
geometry  of  expander  graphs. 

In  this  section  we  will  see  that  the  answer  to  the  above  question  is  yes,  and  like  the 
RIP-2  case,  the  RIP-1  property  of  the  expander  graphs  is  sufficient  to  guarantee  that 
the  ^-minimization  methods  stably  recover  every  sparse  vector. 

In  Lemma  8.8  we  saw  that  the  adjacency  matrix  of  any  (k,  e,  d)  expander  graph  almost 
preserves  the  i\  norm  of  any  A;-sparse  vector.  This  property  is  known  as  the  Restricted 
Isometry  Property  for  i\  norms  or  the  RIP-1  property. 

The  following  proposition  is  a  direct  consequence  of  the  above  RIP-1  property.  It 
states  that  for  any  vector  u ,  if  there  exists  a  vector  v  whose  i\  norm  is  close  to  that 
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of  u.  then  if  approximates  $it  in  the  (i\  norm,  then  v  also  approximates  u  in  the 
i\  norm. 


Theorem  9.1.  Let  $  be  the  adjacency  matrix  of  a  (2k,  e,d)- expander  and  u,v  be 
two  vectors  in  such  that 

IMIi  >  IM|i  -  A 

for  some  A  >  0.  Then 

11“  -  vlli  <  | \  _  (2II™  -  Hfc(«)||i  +  A)  +  2_  ||$it  -  $v||i. 


Proof.  Let  S  =  Supp(Hfc(it)),  and  let  (S1,  •  •  •  ,  S *)  be  a  decreasing  partitioning  of  S 
(with  respect  to  coefficient  magnitudes),  such  that  all  sets  but  (possibly)  S*  have  size 
k.  Note  that  S°  =  S.  Let  be  a  submatrix  of  containing  rows  from  A f(S). 

Finally  let  y  —  u  —  v.  Then,  following  the  argument  of  Berinde  et  al.  [29],  which  also 
appears  in  Sipser  and  Spielman  [228],  we  have  the  following  chain  of  inequalities: 

t 

ll^ylli  >  ll^(s)y||i  >  ll^(s)yslli  -E  E  tel  (9-1-1) 

*=1  j£S\l£Af{S) 

CM);  edge 

>d(i-2£)iiWsii1-x;  e 

*=i  jeS\ieM(S) 
ti,l)  ■  edge 

t  I,  || 

>  d{  1  -  2e)||ys||i  -  2 kde^  Vs'~1  1  >  d(  1  -  2e)||ys||i  -  2de||y||i. 


Hence 

\\®y\\i  +  2de||y||i  >  (1  -  2e)d||y5||i.  (9.1.2) 

Now,  from  the  triangle  inequality  we  have 

||u||i  >  Hull!  —  A  >  || it || !  —  A  —  ||u  —  u || i  (9.1.3) 

=  IMIi  -  A+  ||w-v||i-2(||(w-v)s||i  +  ||(u-v)slli) 

>  |M|i  -  A  +  ||it  -  u || i  -  2  (|| (it  -  u)s||i  +  ||it  -  Hfc(it)Hi) . 

Therefore,  from  Equation  (9.1.2),  we  obtain 


|| it || i  >  1 1 it 1 1 x  —  2 1| it  —  Hfc(it) || !  —  A  +  ||'it  —  ^llr  — 

Rearranging  the  inequality  completes  the  proof. 


2||<&it  —  <&u||i  +  4de\\u  —  u| 
(1-2  e)d 


m 


Using  Theorem  9.1,  Berinde  et  al.  have  shown  that  the  RIP-1  property  is  sufficient 
for  sparse  recovery  using  A  minimization  [29]. 
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Theorem  9.2.  Let  e  be  a  positive  number  smaller  than  '/e,  and  let  $  be  a  sensing  ma¬ 
trix  satisfying  (2 k,e,d)-RIPl.  Let  ol*  be  any  arbitrary  vector  in  M.N ,  and  let  Hfc(a:*) 
denote  the  best  k-term  approximation  of  a*  defined  by  Equation  (2.1.1).  Finally  let 
ejvf  be  an  arbitrary  noise  vector  in  RM ,  and  let  f  =  <&«*  +  eM-  Then  the  solution 
6t  of  the  Basis  Pursuit  problem 

minimize  ||o:,||1 

subject  to  ||/  -  <  \\eM\\i, 

satisfies  the  following  £\j£\  sparse  approximation  guarantee: 

||«  -  a*||i  <  d||a*  -  Hfc(«*)||1  +  c2\\eM\\2,  (9.1.4) 

with  d  =  and  c2  =  ^4^. 

The  Optimization-based  method  of  Theorem  9.2  exploits  the  geometry  of  the  ex¬ 
pander  graphs,  and  performs  significantly  better  than  the  combinatorial  approach 
in  practical  applications.  Figure  9.1  compares  the  performance  of  the  geometric  BP 
algorithm  and  the  combinatorial  SSMP  algorithm.  This  comparison  is  borrowed 
from  experiments  performed  by  Berinde  and  Indyk  [30],  and  shows  that  the  l\  mini¬ 
mization  method  can  recover  signals  with  significantly  higher  sparsity  level  than  the 
SSMP  algorithm.  For  instance,  about  1000  measurements  are  sufficient  to  recover  a 
100-sparse  signals  of  dimensions  20,  000  using  the  BP  algorithm,  whereas  at  least  2200 
measurements  are  required  for  the  successful  recovery  using  the  SSMP  algorithm. 

Even  though  the  t:) -minimization  approach  has  the  best  practical  performance,  the 
computational  cost  of  solving  sparse  recovery  using  the  interior  point  method  is  typi¬ 
cally  0(N1'5M2).  Moreover,  since  the  feasible  set  |[<l>a— /||i  is  not  even  differentiable, 
most  gradient-based  optimization  methods  are  not  directly  applicable.  In  the  next 
section  we  propose  an  alternative  efficient  algorithm  that  approximately  solves  the 
objective  of  the  geometric  approach. 


9.2  Sparse  Approximation  in  £i  Norm 

In  this  section,  we  propose  the  G-GAME  algorithm  for  solving  the  problem  of  sparse 
approximation  in  the  G-norm.  The  G-GAME  algorithm  is  a  special  instance  of  the 
GAME  algorithm  introduced  in  Section  7.2.  Later  on,  in  Section  9.3  we  will  use 
the  G-GAME  algorithm  to  approximately  solve  the  non-smooth  7) -minimization  of 
Theorem  9.2. 

Let  $  be  any  M  x  N  matrix  with  M  A,  let  a*  be  a  sparse  vector  in  A (k,r), 
defined  by  Equation  (6. 1.2), and  let  eM  be  any  vector  in  Mm .  Let  f  =  <&«*  +  eM 
denote  the  measurement  vector.  Sparse  approximation  in  the  £ \ -norm  refers  to  the 
following  problem 

minimizea,eA(fc,T)||^tt  -  / ||i-  (9.2.1) 
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(a)  M  =  1000  measurements  are  sufficient  for  successful  recovery  of  100-sparse  signals  using 
the  Basis  Pursuit  algorithm. 
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(b)  M  =  2200  measurements  are  necessary  for  successful  recovery  of  100-sparse  signals  using 
the  SSMP  algorithm.  jg 

Figure  9.1:  Comparisons  of  exact  recovery  experiments  for  SSMP  and  BP  algorithms 
as  provided  in  [30].  All  plots  are  for  the  same  signal  length  N  =  20000,  and  left 
degree  d  =  8.  Each  experiment  is  repeated  independently  100  times. 


Algorithm  6  The  h-GAME  algorithm 
Inputs:  /,  <&,  and  parameters  T,  r  and  r/  >  0. 

Output:  A  T-sparse  vector  a. 

1:  Set  P1  =  [0]M. 

2:  for  t  —  1,  *  »■  *  ,  T  do 
3:  Let  r*  =  <&  P* 

4:  Find  the  index  i  of  one  largest  (in  magnitude)  element  of  rl. 

5:  Let  of  be  a  1-sparse  vector  with 

Supp(cU)  =  {i },  and  a\  =  — r  Sign  (rfj  . 

6:  Update  Qt+1  =  Pt  +  \  ($«*-/). 

7:  For  each  j  e  [M],  let  P*+1  =  S  (Q*+1,  l). 

(S  (Qj+1,  l)  uses  the  soft-thresholding  operator  of  Definition  2.3  with  6  =  1). 

8:  end  for 

9:  Output  a  =  ^  Y2t=i  CK*- 


Unfortunately,  since  A(r,  k )  is  not  sparse,  solving  the  problem  of  Equation  (9.2.1)  is 
intractable.  However,  this  optimization  problem  is  equivalent  to  sparse  approximation 
in  the  £q  norm  with  q  =  1.  Therefore  the  results  of  Chapter  6.1  can  be  applied  to 
approximately  solve  Equation  (9.2.1)  efficiently. 

Remark  9.3.  Note  that  throughout  this  section  we  assume  that  an  upper-bound  r 
on  the  £i-norm  of  a*  is  known  a  priori.  While  this  assumption  is  directly  valid  in 
many  applications,  we  will  still  provide  a  way  to  efficiently  compute  an  estimate  in 
the  expander-based  compressed  sensing  problem  in  Section  9.3. 

Following  Equation  (6.1.6),  let 

S00  =  [1,1]m  =  {PgMm:  ||P||oo  <  1}, 

and  let  A (r)  be  as  in  Equation  (6.1.1).  Define  the  loss  function  C  :  Soo  x  A (r)  — >  M 
as 

£(P.«)  =  (P.  (<&«-/)>. 

It  follows  from  Equation  (6.1.10)  with  q  =  1  and  p  =  oo  that  the  problem  of  sparse 
approximation  in  the  A  norm  can  be  viewed  as  a  min-max  game: 

min  |  $«  —  /  ||i  =  min  max£(P,o:).  (9.2.2) 

ckEA (k,r)  ckEA (k,r)  PGSoo 

Since  sparse  approximation  in  the  A  norm  is  a  special  case  of  sparse  approximation 
in  the  £q  norm,  the  GAME  Algorithm  (Algorithm  3)  can  be  used  to  approximate  the 


79 


optimal  solution  of  Equation  (9.2.1).  In  order  to  obtain  guarantees  on  the  performance 
of  the  GAME  algorithm,  Theorem  9.4  requires  that  the  Bregman  function  is  properly 
chosen  so  that 

VP,QeH0O,^(P,Q)>||P-Q||20O. 

It  is  easy  to  verify  that  this  requirement  is  satisfied  if  the  squared  Euclidean  norm 
72. (P)  =  ||P HI,  with  £>7?.(P.Q)  =  ||P  —  Q |||  is  used  in  the  GAME  algorithm.  The 
pseudocode  of  the  G-GAME  algorithm  is  provided  in  Algorithm  6,  and  describes 
a  special  GAME  Algorithm  which  exploits  the  choice  of  Euclidean  distance  as  the 
Bregman  function. 

Other  choices  for  the  Bregman  function  may  lead  to  different  convergence  bounds  and 
different  running  times  for  the  new  projections  and  updates.  For  instance,  a  multi¬ 
plicative  update  version  of  the  algorithm  (MU-GAME)  can  be  derived  by  using  the 
Bregman  divergence  based  on  the  relative  entropy  function.  Surprisingly,  the  derived 
guarantees  for  GAME  can  be  shown  to  also  hold  for  MU-GAME  in  a  straightforward 
manner. 

The  general  G-GAME  algorithm  starts  by  finding  a  P1  such  that  V72.(P1)  =  2P1  = 
0 m-  Then  at  every  iteration,  in  step  6,  the  algorithm  Ends  a  Qt+1  with 

V  (M Qt+1,  p*)  -  v(Qt+1,  -  /)»  =  2(Qm  -  P‘)  -  r/(W  -  /)  =  0M, 

and  then  updates  P<+1  via  the  Bregman  projection 

Pm=arg  min  Bn( P.  Qm)  =  S  (Qm,  l)  . 

Pe[-i,i]M 


The  theorem  below  is  based  on  Theorem  9.4  in  Section  7.3,  and  shows  that  for  every 
positive  e,  as  long  as  T  =  O  ^  l j ,  the  GAME  algorithm  after  T  iterations 

Ends  a  T-sparse  vector  at  with  multiplicative  approximation  error  in  the 

measurement  domain. 


Theorem  9.4.  Let  $  be  an  M  x  N  matrix,  and  let  ||<&||i  denote  the  G  norm  of  the 
matrix,  which  is  defined  in  Equation  2.2.1.  Suppose  a*  is  a  vector  in  A (k,r),  and 
let  f  =  +  eM,  where  eM  is  the  measurement  noise  vector.  Let  £  be  any  number 

in  (0,1],  and  let  ct  denote  the  output  of  the  GAME  algorithm  after 


T  =  max  <  k,  M 


iterations  with  regularization  parameter 


leiw||i  +  2||$||ir 
2£||eAf  ||i 


V  = 


||  ejvr  ||  ] 


' 


leM||i  +  2||4>||ir) 

Then,  ct  is  a  vector  in  A  (r,  T)  with 

||$«  -  f  ||i  <  (1  +  e)||eM||i. 


(9.2.3) 


(9.2.4) 


80 


Proof.  At  every  iteration  t,  of  is  a  1-sparse  solution  of  the  minimization  problem 
minimize£(Pt,  ck).  Moreover,  for  every  at.  in  A(l,r)  we  have 

11*0*  -  f\\2  <  II^ck4  -  / ||i  <  (HejvfHi  +  ll^llill"4  -  «*||i)  <  ||eAf||i  +  2||*||it. 
Moreover,  for  every  P  G  Soo  we  have 

M 

Bw(P,P1)  =  ||P||l  =  ^|Pi|2<M. 

3= 1 

As  a  result,  by  setting  G  =  ||ejy<r|| i  +  2||<1>||1t,  and  D  =  \/M ,  from  Theorem  9.4  it 
follows  that  6t  is  a  vector  in  A  (T,  r)  with 

GD 

II *6  -  /||  1  <  min  ||*c*  —  f  ||i  H - j=  <  ||ejvf||i  +  £||ejvf||i  —  (1  +  e)||ejvr||i- 

°teA(fc,r)  2VP 


□ 

Remark  9.5.  In  this  section,  we  assumed  that  the  vector  at*  is  exactly  k-sparse. 
However,  this  assumption  is  without  loss  of  generality.  To  see  this,  observe  that  every 
vector  ol*  G  RN  satisfies 

f  =  <£«*  +  eM  =  *H k(at*)  +  [*(a*  -  Hfc(a*))  +  eM\ , 

and 

||*(a*  -  Hfc(a*))||1  <  ||*||i||a*  -  Hfc(a*)||1. 

Therefore,  one  can  always  assume  that  the  original  vector  is  exactly  k-sparse,  by 
assuming  that  the  measurement  noise  is  <&(«*  —  H*,(a*))  +  eM,  with 

||*(a*  -  Hfc(a*))  +  eM\\i  <  ||^||i||«*  -  Hfc(a*)||1  +  ||eM||i. 

In  particular,  if  $  is  the  adjacency  of  an  expander  graph,  then  ||<&||i  =  d  and 

||*(a*  -  Hfc(a*))  +  eM\\i  <  d  ||a*  -  +  \\eM\\i.  (9.2.5) 

9.3  Expander-based  GAME  Algorithm 

In  Section  9.2  we  showed  that  if  $  is  the  adjacency  of  an  expander  graph,  then  for 
every  vector  ot*  G  ,  given  f  =  +  ejvr,  the  frGAME  algorithm  can  efficiently 

hnd  a  vector  ot  with  bounded  measurement-domain  error 

||*A  -  / Hi  =  0(d  ||a*  -  Hfc(a*)||1  +  Ue^UO  . 
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In  this  section,  we  combine  the  results  of  Section  9.2  and  Theorem  9.2,  and  propose  an 
efficient  algorithm,  called  e-GAME  ,  that  finds  an  estimate  d  with  the  t\jt\  guarantee 
||«  -  a-Hi  =  0(||a*  -  Hfe(a*)||1  +  Mi). 

Similar  to  the  previous  section,  throughout  this  section  without  loss  of  generality  we 
assume  that  the  vector  a.*  is  exactly  /c-sparse  by  adding  the  residual  term  <&(«*  — 
Hfc(a*))  to  the  measurement  noise. 

The  pseudocode  of  the  e-GAME  algorithm  is  shown  in  Algorithm  7.  The  following 
lemma  is  key  in  establishing  the  guarantees  of  e-GAME  . 

Lemma  9.6.  Let  <E>  be  the  adjacency  of  a  (. k ,  e,  d)  expander  graph  with  ||<&||i  =  d.  Let 
(d1,  •  •  •  ,  de)  be  the  vectors  generated  by  the  e-GAME  algorithm.  Then  at  least  one 
of  the  following  two  conditions  holds.  That  is,  either 

(Cl),  there  exists  an  indext  with  ||d4||i  <  ||«*||i  and  ||<fr(d4:^a*)||i  <  (2-he:) ||ejw||i/ 
or 

(C2).  for  every  iteration  t,  Lo4  <  ||«*||i  <  Up*. 

Proof.  We  prove  Lemma  9.6  by  induction.  First  consider  t  —  0.  Since  a*  is  fc-sparse, 
it  follows  from  the  RIP-1  property  of  expander  graphs  (Lemma  8.8),  and  the  triangle 
inequality  that 

(1  -  2e)d\\at*\\1  <  ||$a*||i  =  ||/  -  eM\\i  <  ||/||i  +  ||eM||i. 

Assume  that  Condition  (C2)  holds  for  t  —  1,  we  now  show  that  it  is  also  valid  for 
index  t  via  two  different  cases: 

Case  1:  ||<E»d4  —  f  ||i  >  (1  +  e)||ejvf||i.  If  ||«*||i  <  r4  then 
min  \\$cx  -  /||i  <  ||$a*  -  /||i  =  ||eM||i  < 

||ct||i<r*  (1  +  £) 

which  contradicts  the  (1  +  e)  approximation  guarantee  of  the  fj-GAME  algorithm. 
Therefore  we  must  have  ||a:*||i  >  t1  =  Lo*.  It  also  follows  from  the  induction  hypoth¬ 
esis  that  || a* ||i  <  Upt_1  =  Up*. 

Case  2:  ||<&df  —  /  ||i  <  (1  +  £)||e^f||i.  In  this  case,  if  r*  <  ||«*||1,  then  we  have 
|| d^ ||  i  <  || ok* ||  i  and 

ll^d4  -  o:*) || i  <  ||$d*  -  / 111  +  ||Sa*  -/111  <  (2  +  e)||eM||i,. 

which  is  Condition  (Cl).  Therefore,  if  (Cl)  is  not  valid  then  we  must  have  Up4  =  r4  > 
|| ok* || i-  Also  again  from  the  induction  hypothesis  we  get  || ok  *11  i  >  Lo4_1  =  Lo4.  □ 

The  following  theorem  proves  that  at  least  one  estimate  d4  is  sufficiently  close  to  a*. 
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Algorithm  7  The  e-GAME  Algorithm 
Inputs:  /,  <&,  and  parameters  e,  and  5. 

Output:  An  approximation  d  for  the  vector  a*. 

1:  Set  Lo°  =  0,  Up0  =  and  O  =  log  (|)  . 

2:  for  /  —  !,•••  ,  O  do 

3:  Let  d*  be  the  solution  of  the  b-GAME  algorithm 

with  r*  =  Lo*  and  T *  of  Equation  (9.2.3). 

4:  if  ||$d*  - /||i  <  (1  +  e)||eM||i  then 

5:  Set  Up*  =  r*,  and  Lo*  =  Lo*-1. 

6:  else 

7:  Set  Lo*  =  r*,  and  Up*  =  Up*-1. 

8:  end  if 

9:  end  for 

10:  Output  d  =  argminfe[0]  ||$Hfc(d*)  -  /||i. 


Theorem  9.7.  Let  $  be  the  adjacency  of  a  (2 k,e,d)  expander  graph.  Let  e  and  5 
be  any  two  positive  numbers,  and  let  (d1,  •  •  •  ,  de)  be  the  vectors  generated  by  the 
e-GAME  algorithm.  Then  there  exists  an  index  t  with 

„  *  /  (1  -  2e)d||«*||1  ,  2(2  +  e)||eM||1 

«  1<  - 74 - 7  \ -  +  - 74 - 7TW7 - • 


(1  -  6e) 


(1  —  6  e)d 


Proof.  The  proof  of  Theorem  9.7  relies  on  Lemma  9.6.  If  Condition  (Cl)  is  satisfied 
for  some  index  t,  then  Theorem  9.1  implies  that 


I  *  ~t\\  ^  2(2  +  £)lleM‘lll 

\OL  —  Ct  I  <  - 7 - — - . 

I  X  —  /  -1  n  \  7 


"  111  -  (1-6  e)d  7 

whereas  if  Condition  (C2)  is  satisfied,  then  at  every  iteration  t  we  have 


|| d* || x  —  ||«*||i  <  Lip*  —  ||a*||i  <  Lip*  —  Lo*  < 
In  this  case,  if  O  =  log2  (|)  ,  then  Lemma  8.8  implies  that 


Up0  -  Lo° 


-0M  Up0  —  Lo°  (11/11,  +  MOA 


IAwlll-Ml< 


2e  h/(l  -  2e) 

d  ~  d 


<  ||a*||i<J. 


(9.3.1) 


Furthermore,  since  both  ||a*||i  and  || d || ,  are  smaller  than  r0,  Theorem  9.4  guarantees 
that 

||$de  - /||i  <  (1  +  e)  min  \\&a  -  /||i  <  (1  +  £)\\eM\\i- 
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Therefore,  from  Theorem  9.1  we  get 


\ct  —  ct 


©i 


li  < 


(1  —  2e)5||cKH 

(i  -  e«) 


li  2(2  +  e)lleM| 


(1  —  6  e)d 


□ 

So  far,  we  have  shown  that  at  least  one  of  the  estimates  (a1,  •  •  •  ,  de )  is  sufficiently 
close  to  ct*.  However,  since  a*  is  not  known  a  priori ,  we  cannot  directly  estimate 
which  6i  is  close  enough  to  ct*.  Fortunately,  the  RIP-1  property  of  the  expander 
graphs  allows  us  to  use  the  measurement  domain  accuracy  as  a  proxy  to  measure  the 
data-domain  accuracy  of  the  estimates.  More  precisely,  we  show  that  the  estimate 

d  =  arg  min  H^H/^d*)  -  /||i 


is  sufficiently  close  to  ct* 


Theorem  9.8.  Let  $  be  the  adjacency  of  a  (2 k,  e,  d )  expander  graph.  Let  e  and  S  be 
any  two  positive  numbers,  and  let  d  be  the  output  of  the  e-GAME  algorithm.  Then 


at 


*  H,(d)lh  <  2<511a*111  + 


(1  -  6e)  (1  -  2e) 


2  + 


4(2  +  e)\  ||ejvf||i 


(1  —  6e)  /  d. 


Proof.  Let  b  be  any  vector  in  M.N .  Since  ct*  is  /e-sparse,  from  the  triangle  inequality 
and  the  definition  of  the  best  /e-term  approximation  we  have 

\\a*  -  Hfc(6)||1  <  ||a*  -  bWi  +  ||b  -  Mb)\\i  <  2||a*  -  b||i. 

Now,  observe  that  for  every  t,  ct*  —  is  always  2/e- sparse.  As  a  result,  Lemma  8.8 

yields  that  at  every  iteration  t 


||/-$Hfc(dt)||1  <  HeMlIi  +  H^a 


*  H,(df) 


(9.3.2) 


<  ||eM||i  +  d\\(a.*  —  Hfc(dt))||i  <  ||ejvr||i  + 
Therefore,  it  follows  from  Theorem  9.7  that 


ct 


|/-SH*(d)||1< 


2d(l  -  2e)5\\ct*\ 
(1  -  6e) 


+  1  + 


4(2  +  e) 
(1  —  6e) 


\Om  1- 


(9.3.3) 


Moreover,  since  ct*  —  II/,  (d)  is  2/e-sparse,  the  RIP1  property  implies  that 


ct 


*  xx  /  Il^(a*-Hfc(d)||1  ^  ||/-^Hfc(d)||1  +  ||eM||1 

Hfc(a)|  i  < - : - — - < 


(1-2  e)d 


(1-2  e)d 


which  completes  the  proof. 


□ 


84 


The  following  corollary  is  a  direct  consequence  of  Theorem  9.8. 


Corollary  9.9.  Let  $  be  the  adjacency  of  a  (2 k,e,d)  expander  graph,  where  e  is  a 
constant  less  than  Let  a*  be  any  vector  in  RN ,  and  let  eM  be  any  noise  vector  in 
]RM.  Let 

SNRi  =  — - lHfc(a.?ll1  (9.3.4) 


|a*-Hfc(a*)  \U  + 


\\eM\\l 


Then  the  e-GAME  algorithm  with  5  =  and  £  —  1  recovers  a  vector  a  with 


at*  —  a\\i  =  O  [  ||o:*  —  Hfc(«*)||i  + 


\eM\\l 

d 


Moreover,  the  overall  recovery  time  is  O  (MNd  SNR^  log  SNRx) . 


Proof.  By  treating  <&(«*  —  H&(a:*))  +  eM  as  the  measurement  domain  noise,  we  can 
always  without  loss  of  generality  assume  that  a*  is  exactly  /c-sparse.  The  data-domain 
sparse  approximation  bound  then  follows  from  Theorem  9.8  by  setting  5  =  .  and 

£  —  1. 

To  calculate  the  overall  running  time  of  the  algorithm  note  that  the  e-GAME  algo¬ 
rithm  requires  0  =  (P(logSNRi)  iterations.  At  each  such  iteration,  the  G-GAME 
algorithm  requires  T  =  0(M SNR^)  iterations  (Equation  (9.2.3))  in  which  the  bottle¬ 
neck  is  one  matrix- vector  multiplication  (i.e. ,  calculating  $TP/).  This  multiplication 
can  be  calculated  efficiently  using  0(N  d)  operations  as  the  graph  is  d  regular.  □ 

Remark  9.10.  An  alternative  approach  is  to  use  Nesterov’s  smoothing  method  for 
approximately  solving  non-smooth  objective  functions  [204,  203] .  We  omit  the  details 
of  this  implementation.  With  the  Nesterov  method,  we  still  need  C(logSNRx)  outer 
iterations,  while  the  number  of  inner  iterations  can  be  reduced  to  T  =  O(MSNRx). 
However,  each  inner  iteration  of  the  Nesterov  method  requires  solving  three  smooth 
convex  optimization  problems,  and  is  much  more  complicated  than  calculating  one 
matrix-vector  multiplication. 


Tables  9.1  and  9.2  compare  different  properties  of  various  sparse  recovery  algorithms 
which  use  sparse  matrices  constructed  from  the  adjacencies  of  bipartite  graphs.  The 
algorithms  are  categorized  as  either  combinatorial ,  which  exploit  the  combinatorial 
properties  (e.g.  the  unique  neighborhood  property)  of  the  graph,  or  geometric ,  which 
exploit  the  geometric  properties  (e.g.  the  RIP-1  property)  of  that  graph.  The  geo¬ 
metric  algorithms  are  often  capable  of  recovering  signals  with  higher  sparsity  level, 
whereas  the  combinatorial  algorithms  are  computationally  more  efficient. 

The  for  all  sparse  recovery  model  corresponds  to  recovery  algorithms  that  (surely  or 
with  high  probability)  provide  a  close  sparse  approximation  to  every  vector  a*  e  , 
whereas  the  for  each  model  focuses  on  the  recovery  algorithms  that  can  recover  a 
sparse  approximation  to  each  vector  ot*  with  high  probability.  The  almost  sparse 
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model  assumes  that  the  vector  has  ^-significant  entries  and  the  remaining  entries  are 
sufficiently  close  to  zero.  The  positive  model  corresponds  to  recovery  of  sparse  vectors 
with  non-negative  entries. 


9.4  Experimental  Results 

In  this  section,  we  provide  experimental  results  to  empirically  investigate  the  fidelity 
of  the  algorithms  proposed  in  this  chapter.  Throughout  these  experiments  we  used 
N  =  1000,  M  =  200  and  k  =  20  for  illustration  to  demonstrate  the  typical  behavior 
of  the  algorithms  for  other  N,  M  and  k.  We  first  generated  a  200  x  1000  random 
expander  matrix  $,  and  then  repeated  the  following  experiment  100  times.  We 
generated  a  sparse  vector  with  random  support,  random  sign,  and  unit  magnitudes, 
generated  compressive  measurements,  and  then  recovered  a  sparse  estimate  for  the 
original  signal. 

Figure  9.2(a)  plots  the  measurement-domain  error  of  the  Tl-GAME  algorithm  as  a 
function  of  the  number  of  iterations.  Here  we  let  the  algorithm  continue  for  100,  000 
iterations.  The  Figure  shows  that  with  G-GAME  ,  the  measurement- domain  error 
consistently  decreases.  Moreover,  after  some  initial  burn-in,  the  rates  of  convergence 
are  approximately  ^  (as  opposed  to  the  slower  rate  which  was  expected  from 

theory).  The  T  rate  of  convergence  matches  the  best  known  first-order  optimization 
results  [203]. 

Figure  9.2(b)  compares  the  performance  of  the  e-GAME  algorithm  with  Basis  Pur¬ 
suit  algorithm,  and  with  SSMP  [30]  algorithm  in  terms  of  their  stability  against  the 
measurement  noise.  As  above,  we  set  N  =  1000,  M  =  200  and  k  =  20,  and  repeated 
each  experiment  independently  100  times.  The  signal  is  generated  in  the  same  process 
as  above,  and  we  used  white  Gaussian  measurement  noise  with  standard  deviation 
ranging  from  10-5  to  10-2. 

In  this  experiment  we  compared  the  average  reconstruction  error  of  the  three 

algorithms  above  as  a  function  of  the  noise  level.  We  used  the  CVX  package  [134,  133] 
which  is  a  standard  convex  optimization  package,  for  directly  solving  the  Basis  Pursuit 
algorithm  of  Theorem  9.2.  We  also  used  the  SSMP  code  provided  by  Berinde  and 
Indyk  [30]  with  40  outer  iterations,  100  inner  iterations,  and  threshold  25  to  solve  the 
SSMP  optimization  1 . 

It  is  interesting  that  the  approximation  error  of  the  e-GAME  algorithm  is  very  close 
to  the  approximation  error  of  the  Basis  Pursuit  algorithm,  and  significantly  lower 
than  the  error  of  the  SSMP  algorithm.  This  experiment,  and  many  other  similar 
experiments  confirm  the  advantages  of  the  geometric  reconstruction  algorithms  over 

1We  observed  that  if  the  SSMP  algorithm  is  provided  exactly  with  the  sparsity  level  k  =  20  then 
the  algorithm  has  a  much  better  performance;  however,  the  performance  of  the  SSMP  algorithm  is 
significantly  decreased  even  if  the  threshold  is  slightly  higher  than  the  true  sparsity  level. 


Tabic  9.1:  Summary  of  sparse  recovery  algorithms  that  use  geometric  properties 
of  matrices  constructed  from  sparse  bipartite  graphs.  All  bounds  ignore  the  0() 
constants.  (3  is  a  constant,  possibly  different  in  each  row,  and  SNRi  is  defined  by 
Equation  (9.3.4).  The  rows  of  the  table  are  sorted  first  by  the  signal  model,  then  by 
the  number  of  measurements,  and  finally  by  recovery  time  in  a  decreasing  order. 


Approach 

Number  of 

measurements 

Decoding 

time 

Signal 

Model 

Noise 

Tolerance 

Expander-codes 

Alg.  4  [159,  258,  259] 

k  log  f 

AMogf 

Almost  sparse 

No 

Minimal 
Expansion  [169] 

k  log  f 

AMogf 

Positive 

Yes 

Count-Min 
[78,  79,  72] 

k  log  AT 
k  log^  N 

NlogN 
k  log^  N 

For  each 

No 

LDPC-codes 

[125] 

k  log  f 

k  log^  N 

For  each 

Yes 

SSMP 
[153,  31,  30] 

k  log  f 

N  log2  flog  SNR! 

For  all 

Yes 

Tabic  9.2:  Summary  of  sparse  recovery  algorithms  that  use  combinatorial  properties 
of  matrices  constructed  from  sparse  bipartite  graphs.  All  bounds  ignore  the  OQ 
constants.  f3  is  a  constant,  possibly  different  in  each  row,  and  SNR.!  is  defined  by 
Equation  (9.3.4).  All  algorithms  provide  guarantees  in  the  for  all  signal  model.  The 
rows  of  the  table  are  sorted  first  by  the  number  of  measurements,  and  then  by  recovery 
time  in  a  decreasing  order. 


Approach 

Number  of 

measurements 

Decoding 

time 

Noise 

Tolerance 

Expander-codes 
[137,  138] 

k{\ogNf 

Y1,5/c2(log  N)2/3 

No 

Basis  Pursuit 
[29] 

*l°g£ 

N^k2  (logf)2 

Yes 

e-GAME 
Algorithm  7  [156] 

*l°g£ 

kN  log  (f-)2  SNR2  log  SNR! 

Yes 
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the  combinatorial  algorithms  in  terms  of  stable  signal  recovery  in  expander-based 
compressed  sensing. 


(a)  The  dependency  between  the  measurement-domain  error  ||4>o:  —  /||i  and  the 
number  of  iterations  of  the  H-GAME  Algorithm.  The  empirical  rate  of  convergence 
is  approximately  ^  (as  opposed  to  slower  rate  expected  from  theory). 


(b)  Approximate  recovery  experiments  with  SSMP,  Basis  Pursuit,  and  e-GAME 
algorithms  for  expander-based  compressed  sensing.  The  measurement  noise  standard 
deviation  ranges  from  ICG5  to  ICG2,  and  the  approximation  error  is  measured  as 
||a*-a||1/||a*||i. 

Figure  9.2:  Empirical  performance  of  the  e-GAME  algorithm.  Here  we  fixed  N  = 
1000,  M  =  200  and  k  =  20,  and  repeated  each  experiment  independently  100  times 
using  a  random  expander  graph  with  left-degree  d  —  8. 


Chapter  10 


Expander-based  Compressed 
Sensing  in  the  Presence  of  Poisson 
Noise 

10.1  Introduction 

In  Chapter  4  we  introduced  different  applications  of  compressed  sensing,  and  showed 
that  the  compressed  sensing  framework  is  particularly  appealing  whenever  the  mea¬ 
surement  is  costly  or  constrained  in  some  sense.  For  example,  in  the  context  of 
photon-limited  applications  (such  as  low-light  imaging),  the  photo-multiplier  tubes 
used  within  sensor  arrays  are  physically  large  and  expensive.  Similarly,  when  mea¬ 
suring  network  traffic  flows,  the  high-speed  memory  used  in  packet  counters  is  cost- 
prohibitive  [8].  These  problems  appear  ripe  for  the  application  of  CS. 

However,  photon-limited  measurements  [229]  and  arrivals/departures  of  packets  at 
a  router  [33]  are  commonly  modeled  with  a  Poisson  probability  distribution,  posing 
significant  theoretical  and  practical  challenges  in  the  context  of  CS.  One  of  the  key 
challenges  is  the  fact  that  the  measurement  error  variance  scales  with  the  true  inten¬ 
sity  of  each  measurement,  so  that  we  cannot  assume  constant  noise  variance  across 
the  collection  of  measurements.  Futhermore,  measurements,  underlying  true  intensi¬ 
ties,  and  system  models  are  all  subject  to  certain  physical  constraints  which  play  a 
significant  role  in  performance. 

Recent  works  [215,  155,  63,  177]  explore  methods  for  CS  reconstruction  in  the  pres¬ 
ence  of  impulsive,  sparse  or  exponential  family  noise,  but  do  not  account  for  the 
physical  constraints  associated  with  a  typical  Poisson  setup  and  do  not  contain  the 
related  performance  bounds  emphasized  in  this  chapter.  In  previous  work  [252,  210], 
Willett  and  Raginsky  showed  that  a  Poisson  noise  model  combined  with  conventional 
dense  CS  sensing  matrices  (properly  scaled)  yielded  performance  bounds  that  were 
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somewhat  sobering  relative  to  bounds  typically  found  in  the  literature.  In  particular, 
they  found  that  if  the  number  of  photons  (or  packets)  available  to  sense  were  held 
constant,  and  if  the  number  of  measurements  M,  was  above  some  critical  threshold, 
then  larger  M  in  general  led  to  larger  bounds  on  the  error  between  the  true  and  the 
estimated  signals.  This  can  intuitively  be  understood  as  resulting  from  the  fact  that 
dense  CS  measurements  in  the  Poisson  case  cannot  be  zero-mean,  and  the  DC  offset 
used  to  ensure  physical  feasibility  adversely  impacts  the  noise  variance. 

The  approach  considered  in  this  chapter  hinges  on  reconstructing  a  signal  from  com¬ 
pressive  measurements  by  optimizing  a  sparsity-regularized  goodness-of-ht  objective 
function.  In  contrast  to  many  CS  approaches,  however,  we  measure  the  fit  of  an  esti¬ 
mate  to  the  data  using  the  Poisson  log-likelihood  instead  of  a  squared  error  term.  This 
chapter  demonstrates  that  the  bounds  developed  in  previous  work  can  be  improved 
for  some  sparsity  models  by  considering  alternatives  to  dense  sensing  matrices  with 
random  entries.  In  particular,  we  show  that  sparse  sensing  matrices  given  by  scaled 
adjacency  matrices  of  expander  graphs  have  important  theoretical  characteristics  that 
are  ideally  suited  to  controlling  the  performance  of  Poisson  CS. 

Formally,  suppose  we  have  a  signal  a*  G  with  known  t\  norm  ||«*||i  (or  a 
known  upper  bound  on  || ck* || i) -  We  aim  to  find  a  matrix  A  G  M(}i/xJV  with  M,  the 
number  of  measurements,  as  small  as  possible,  so  that  a*  can  be  recovered  efficiently 
from  the  measured  vector  f  G  iW,  which  is  related  to  Act*  through  a  Poisson 
observation  model.  The  restriction  that  elements  of  A  be  nonnegative  reflects  the 
physical  limitations  of  many  sensing  systems  of  interest  (e.g.,  packet  routers  and 
counters  or  linear  optical  systems). 

In  Section  8.1  we  introduced  the  adjacency  matrices  of  expander  graphs  as  an  alter¬ 
native  to  dense  random  matrices  within  the  compressed  sensing  framework,  leading 
to  computationally  efficient  recovery  algorithms.  Subsequently,  we  saw  that  varia¬ 
tions  of  the  standard  recovery  approaches  such  as  basis  pursuit  (Theorem  9.2)  and 
matching  pursuit  (Corollary  8.19)  are  consistent  with  the  expander  sensing  approach 
and  can  recover  the  original  sparse  signal  successfully.  In  the  presence  of  Gaussian 
or  sparse  noise,  random  dense  sensing  and  expander  sensing  are  known  to  provide 
similar  performance  in  terms  of  the  number  of  measurements  and  recovery  computa¬ 
tion  time.  Furthermore,  expander  sensing  requires  less  storage  whenever  the  signal  is 
sparse  in  the  canonical  basis. 

The  approach  described  in  this  chapter  consists  of  the  following  key  elements: 

•  expander  sensing  matrices  and  the  RIP-1  associated  with  them; 

•  a  reconstruction  objective  function  which  explicitly  incorporates  the  Poisson 
likelihood; 

•  a  countable  collection  of  candidate  estimators;  and 
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•  a  penalty  function  defined  over  the  collection  of  candidates,  which  satisfies  the 
Kraft  inequality  and  which  can  be  used  to  promote  sparse  reconstructions. 

In  general,  the  penalty  function  is  selected  to  be  small  for  signals  of  interest,  which 
leads  to  theoretical  guarantees  that  errors  are  small  with  high  probability  for  such 
signals.  In  this  chapter,  exploiting  the  RIP-1  property  and  the  non- negativity  of  the 
expander-based  sensing  matrices,  we  show  that,  in  contrast  to  random  dense  sensing, 
expander  sensing  empowered  with  a  maximum  a  posteriori  (MAP)  algorithm  can 
approximately  recover  the  original  signal  in  the  presence  of  Poisson  noise,  and  we 
prove  bounds  which  quantify  the  MAP  performance.  As  a  result,  in  the  presence  of 
Poisson  noise,  expander  graphs  not  only  provide  general  storage  and  computational 
advantages,  but  they  also  allow  devising  efficient  MAP  recovery  methods  with  perfor¬ 
mance  guarantees  comparable  to  the  best  fc-term  approximation  of  the  original  signal. 
Finally,  the  bounds  are  tighter  than  specific  dense  matrices  proposed  by  Willett  and 
Raginsky  [252,  210]  whenever  the  signal  is  sparse  in  the  canonical  domain,  in  that  a 
log  term  in  the  bounds  in  [210]  is  absent  from  the  bounds  presented  in  this  chapter. 

10.1.1  Dense  sensing  matrices  for  Poisson  CS 

In  recent  work,  Willett  and  Raginsky  established  performance  bounds  for  CS  in  the 
presence  of  Poisson  noise  using  dense  sensing  matrices  based  on  appropriately  shifted 
and  scaled  Rademacher  ensembles  [252,  210].  Several  features  distinguish  that  work 
from  the  present  chapter: 

•  The  dense  sensing  matrices  used  in  [252,  210]  require  more  memory  to  store  and 
more  computational  resources  to  apply  to  a  signal  in  a  reconstruction  algorithm. 
As  explained  in  Table  8.1,  the  expander-based  approach,  in  contrast,  is  more 
efficient. 

•  The  expander-based  approach  described  in  this  chapter  works  only  when  the 
signal  of  interest  is  sparse  in  the  canonical  basis.  In  contrast,  the  dense  sensing 
matrices  used  in  [252,  210]  can  be  applied  to  arbitrary  sparsity  bases  (though 
the  proof  technique  there  needs  to  be  altered  slightly  to  accommodate  sparsity 
in  the  canonical  basis). 

•  The  bounds  in  both  this  chapter  and  [252,  210]  reflect  a  sobering  tradeoff  be¬ 
tween  performance  and  the  number  of  measurements  collected.  In  particular, 
more  measurements  (after  some  critical  minimum  number)  can  actually  degrade 
performance  as  a  limited  number  of  events  (e.g.,  photons)  are  distributed  among 
a  growing  number  of  detectors,  impairing  the  SNR  of  the  measurements. 
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10.2 


Compressed  sensing  in  the  presence  of  Pois¬ 
son  Noise 


10.2.1  Problem  statement 


We  wish  to  recover  an  unknown  vector  a*  G  of  Poisson  intensities  from  a  measured 
vector  f  G  sensed  according  to  the  Poisson  model 

f  ~  Poisson(A«*),  (10.2.1) 


where  A  G  is  a  positivity-preserving  sensing  matrix.  That  is,  for  each  j  G 

{1  fj  is  sampled  independently  from  a  Poisson  distribution  with  mean 

{Aa*)f 


M 

3= 1 

where,  for  any  z  G  Z+  and  A  G  M+,  we  have 

p  /  n  a  f  —re_A  if  A  >  0 
PA(^)  =  <  z\ 

I  l{^=o}  otherwise 

where  the  A  =  0  case  is  a  consequence  of  the  fact  that 

\z 

lim  /—re~x  =  1(^=0}- 
A^O  z\ 


(10.2.2) 


(10.2.3) 


We  assume  that  the  £±  norm  of  a*  is  known,  ||«*||1  =  L  (although  later  we  will  show 
that  this  assumption  can  be  relaxed).  We  are  interested  in  designing  a  sensing  matrix 
A  and  an  estimator  a  =  «(/),  such  that  a*  can  be  recovered  with  small  expected 
£i  risk 

R  (d,  a*)  =  ||o:  —  a*||i, 

where  the  expectation  is  taken  w.r.t.  the  distribution  P^c**- 


10.2.2  The  proposed  estimator  and  its  performance 

For  future  convenience,  we  introduce  the  following  notation.  Given  N  and  1  <  k  < 
7V/4,  we  denote  by  Gk,N  a  (2 /c,  l/16)-expander  with  left  set  size  N  whose  existence  is 
guaranteed  by  Proposition  8.2.  Then  Gk,N  =  C V,C,E )  has 

\V\  =  N,  \C\  —  M  —  0{k  log (N/k)),  d  =  0(\og(N/k)). 

Moreover,  since  Gk,N  is  regular,  there  exists  a  minimal  set  12  C  V  of  size  at  most 
M  =  |C|,  such  that  its  neighborhood  covers  all  of  C,  i.e  A/"(12)  =  C.  Hence,  P  1. 
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To  recover  cl*,  we  will  use  a  penalized  Maximum  Likelihood  Estimation  (pMLE) 
approach.  Let  us  choose  a  convenient  1  <  k  <  N/4  and  take  A  to  be  the  normalized 
adjacency  matrix  of  the  expander  Gk,N-  A  =  <f>/d.  Moreover,  let  us  choose  a  finite 
or  countable  set  ©^  of  candidate  estimators  o  e  with  ||o||i  <  L,  and  a  penalty 
pen  :  ©l  — >  M+,  satisfying  the  Kraft  inequality 1 

^  g-pen(a)  <  L 
a.£@L 

For  instance,  we  can  impose  less  penalty  on  sparser  signals  or  construct  a  penalty 
based  on  any  other  prior  knowledge  about  the  underlying  signal. 

With  these  definitions,  we  consider  the  following  penalized  maximum  likelihood  esti¬ 
mator  ( pMLE ): 

cl  =  argmin  [—  logP^ a(f)  +  2pen(a)]  (10.2.4) 

aeeL 

One  way  to  think  about  the  procedure  in  (10.2.4)  is  as  a  Maximum  a  posteriori  Proba¬ 
bility  (MAP)  algorithm  over  the  set  of  estimates  0 /\ ,  where  the  likelihood  is  computed 
according  to  the  Poisson  model  (10.2.3)  and  the  penalty  function  corresponds  to  a 
negative  log  prior  on  the  candidate  estimators  in  0  L . 

Our  main  bound  on  the  performance  of  the  pMLE  is  as  follows: 

Theorem  10.1.  Let  $  be  the  normalized  adjacency  matrix  of  G^n,  let  cl*  G  be 
the  original  signal  compressively  sampled  in  the  presence  of  Poisson  noise,  and  let  a 
be  obtained  through  (10.2.4).  Then 

R(cl,cl*)  <  4 :<7fc(a*) 

+  8  II  min  [KL(FAa*  ||  Pa«)  +  2pen(a)],  (10.2.5) 

V  «e0L 

where 

Knra\\ph)  ^  £  p„(»)  log  Ljf 

ye  zf  h{y> 

is  the  Kullback-Leibler  divergence  (relative  entropy)  between  Pg  and  ¥h,  and  crfc(a*) 
is  the  best  k-term  approximation  to  cl*,  defined  in  Equation  (2.1.2). 

Proof.  Since  cl  €  ©l,  we  have  L  =  1101*11!  >  || ck || x.  Hence,  using  Theorem  9.1  with 
A  =  0,  we  can  write 


[|o*  -  o||i  <  4 ak(a*)  +  4||A(o*  -  o)||i. 

1Many  penalization  functions  can  be  modified  slightly  (e.g.  scaled  appropriately)  to  satisfy  the 
Kraft  inequality.  All  that  is  required  is  a  finite  collection  of  estimators  (i.e.  Qjf)  and  an  associated 
prefix  code  for  each  candidate  estimate  in  0  .  For  instance,  this  would  certainly  be  possible  for  a 
total  variation  penalty,  though  the  details  are  beyond  the  scope  of  this  paper. 
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Taking  expectations,  we  obtain 


R(a,a*)  <  4cxfc(a*)  +  4EAct*||A(a*  -d)||i 

<  4 ak(ct*)  +4^^. \\A(a*  -  d)||?  (10.2.6) 

where  the  second  step  uses  Jensen’s  inequality.  Using  Lemmas  10.5  and  10.6  in 
Section  10.2.4,  we  have 

EAQ*|l^-(a*  -  ck) Ilf  <  4L  min  [KL(PAc**  ||  Pac*)  +  2pen(a)] 

cte©z. 

Substituting  this  into  (10.2.6),  we  obtain  (10.2.5).  □ 


The  bound  of  Theorem  10.1  is  an  oracle  inequality,  it  states  that  the  I i  error  of  d 
is  (up  to  multiplicative  constants)  the  sum  of  the  fc-term  approximation  error  of  ol* 
plus  \[L  times  the  minimum  penalized  relative  entropy  error  over  the  set  of  candidate 
estimators  ©l.  The  first  term  in  (10.2.5)  is  smaller  for  sparser  «*,  and  the  second  term 
is  smaller  when  there  exists  ol  G  ©l  which  is  simultaneously  a  good  approximation 
to  a*  (in  the  sense  that  the  distributions  P Aa*  and  P^  are  close)  and  has  a  low 
penalty. 


Remark  10.2.  So  far  we  have  assumed  that  the  t\  norm  of  a*  is  known.  However, 
If  ||«*||i  is  not  known  a  priori,  we  can  still  estimate  it  with  high  accuracy  using  noisy 
compressive  measurements.  Observe  that,  since  each  measurement  fj  is  a  Poisson 
random  variable  with  mean  ( Aa*)- ,  JT  fj  is  Poisson  with  mean  ||Aa*||i.  Therefore, 

‘sJ'Yfj  fj  is  approximately  normally  distributed  with  mean  ~  y/||  A«*||i  and  variance 

~  1  [189,  Sec.  6.2]. 2  Hence,  Mill’s  inequality  [249,  Thm.  4- 7/  guarantees  that,  for 
every  positive  t, 

I -  —2 12 

Pr  lY.fi-  VM«*lli  >t 


~  y/2n t' 


where  <  is  meant  to  indicate  the  fact  that  this  is  only  an  approximate  bound,  with  the 
approximation  error  controlled  by  the  rate  of  convergence  in  the  central  limit  theorem. 
Now  we  can  use  the  RIP-1  property  of  the  expander  graphs  obtain  the  estimates 


I ^  fj  —  t\  <  ||Aa:*||1  <  ||«*||i, 


and 


Yhj  fj  + t 


^  ll^«*  i  ^  ii„*M 
>  — — r-  >  a  1 


(1  -  2t)  -  (1  -  2e) 

that  hold  with  probability  (approximately)  at  least  1  —  (v^hr t)~le~2t2 . 


2This  observation  underlies  the  use  of  variance-stabilizing  transforms. 
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10.2.3  A  bound  in  terms  of  error 

The  bound  of  Theorem  10.1  is  not  always  useful  since  it  bounds  the  ly  risk  of  the 
pMLE  in  terms  of  the  relative  entropy.  A  bound  purely  in  terms  of  iy  errors  would  be 
more  desirable.  However,  this  is  not  easy  to  obtain  without  imposing  extra  conditions 
either  on  at*  or  on  the  candidate  estimators  in  0  jj .  This  follows  from  the  fact  that 
the  divergence  KL(P^ce*  ||PacJ  may  take  the  value  +oo  if  there  exists  some  f  such 
that  PAa(/)  =  0  but  P Aa*(f)  >  0. 

One  way  to  eliminate  this  problem  is  to  impose  an  additional  requirement  on  the 
candidate  estimators  in  0^:  There  exists  some  c  >  0,  such  that 

Act  Pc,  Wet  G  0L  (10.2.7) 

Under  this  condition,  we  will  now  develop  a  risk  bound  for  the  pMLE  purely  in  terms 
of  the  £y  error. 

Theorem  10.3.  Suppose  that  all  the  conditions  of  Theorem  10.1  are  satisfied.  In 
addition,  suppose  that  the  set  0 l  satisfies  the  condition  (10.2.7).  Then 

R  (d,  a*)  <  4<jfc(a*)  +  8\  L  min  -||a*  —  a||?  +  pen(a)  .  (10.2.8) 

y  aeeL  [c 

Proof.  Using  Lemma  10.7  in  Section  10.2.4,  we  get  the  bound 

KL(PAct*||PAcJ  <  |||CK*  -  a||?,  Va  €  0L. 

Substituting  this  into  Eq.  10.2.5,  we  get  (10.2.8).  □ 

Remark  10.4.  Because  every  ct  €  0^  satisfies  ||a||i  <  L,  the  constant  c  cannot  be 
too  large.  In  particular,  if  (10.2.7)  holds,  then  for  every  ct  e  0l  we  must  have 

HAalli  >  M min (Aa*)j  >  Me. 

3 

On  the  other  hand,  since  ||4?||i  =  1,  we  have  HAaUx  <  Hallx  <  L.  Thus,  a  necessary 
condition  for  (10.2.7)  to  hold  is  c  <  L/M .  Since  M  =  0(k\og(N/k)),  the  best  risk 
we  may  hope  to  achieve  under  some  condition  like  (10.2.7)  is  on  the  order  of 

R  (a,  a*)  <  4 ak(ot*) 

+  C . /  min  [klog(N/k)\\ct  —  a*||?  +  Lpen(a)]  (10.2.9) 

V  a eeL 

for  some  constant  C,  e.g.,  by  choosing  c  oc  k ■  Effectively,  this  means  that, 
under  the  positivity  condition  (10.2.7),  the  £\  error  of  ct  is  the  sum  of  the  k-term 
approximation  error  of  ct*  plus  \[M  =  \Jk\og(N/k )  times  the  best  penalized  £y  ap¬ 
proximation  error.  The  first  term  in  (10.2.9)  is  smaller  for  sparser  ct*,  and  the 
second  term  is  smaller  when  there  is  a  ct  e  0L  which  is  simultaneously  a  good  Iy 
approximation  to  ct*  and  has  a  low  penalty. 
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10.2.4  Technical  lemmas 


Lemma  10.5.  Any  a  E  Ql  satisfies  the  bound 

m  2 

\\A{am  ~  a) ||?  <  4LJ2  \  ^a*)]'2  -  (A«),1/2  . 

2=  1 

Proof.  Since  A  is  the  normalized  adjacency  of  the  d-regular  expander  graph,  || ^4.||  x  = 
1.  Therefore, 


lAallx  <  || ok || i  <  L,  Va  G  0 


L- 


(10.2.10) 


Let  (3*  =  Aa*  and  / 3  =  Aa.  Then 
/  m  x  2 

II/3'  -  0ii?  =  E  i0'<  -  a 


M 


E  /3*.1/2-0.1/2 . 0'.1/2+/31/2 


i=l 

M 


,  *=1 


M 


M 


<  E  K/2  -  a1/2 

*J=1 


0f2+0f  <2 e  of -4'2  .El0*i+0jl 


i=l 


M 


M 


2E  D'T-DT  ‘  ■  (Il01i  +  Mi )<4iE  0".1/2-a1/2 


1=1 

2 


2=1 

M 


2=1 


41 E  (Aa*)‘/2  -  (Aa),1 


.1/2 


2=1 


The  first  and  the  second  inequalities  are  by  Cauchy-Schwarz,  while  the  third  inequal¬ 
ity  is  a  consequence  of  Eq.  (10.2.10).  □ 

Lemma  10.6.  Let  a.  be  a  minimizer  in  Eq.  (10.2.4).  Then 


E 


Aa* 


M 


E  (A<x-)'L  -  ( Aa 


.1/2 


2=1 


<  min  [KL(FAa*  ||  PAce)  +  2pen(a)] . 

c*e0L 

Proof.  Using  Lemma  10.8  below  with  g  =  Aa.*  and  h  =  Aa  we  have 


(10.2.11) 


E  Aa* 


M 


E  (2la*),1/2  -  (Aa 


.1/2 


2=1 


=  E 


Aa* 


2  log 


I A  Acx*  (fWAa(f)dv(f) 
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Clearly 


v/p  AcA  (/)P  A&(f)du(f)  =  E 


Act* 


P  A&jf) 

P  Act*(f) 


We  now  provide  a  bound  for  this  expectation.  Let  6t  be  a  minimizer  of  KL(Pj4q,*  ||Pac*)  + 
2pen(o:)  over  ot  G  @l.  Then,  by  definition  of  6t ,  we  have 

yFW/)e-pen(A)  >  V^M)e~pen{(x) 

for  every  f.  Consequently, 

1  .  ^P  AA(/)e-pen(A) 


Eao* 

We  can  split  the  quantity 


1 ?A&(f) 

*Aa*(f) 


< 


2~Kaol* 


log 


e~ Pen(«)EAQ;» 


\Z^Act(f)e-pen{&) 


P A&(f ) 

*Aa*(f) 


v^W7)  e-pen(a)EAct»  ^ 


jjWf) 

VAa*(f) 


into  three  terms: 


E 


Act* 


2E 


log  ( r-^±) 

gVP  aM)J 


log 


+  2pen(o:) 

V^W7)e-Pen(d) 


yp  AcA  if W Act*  [y 


lA&if) 

vAa*(f) 


We  show  that  the  third  term  is  always  nonpositive,  which  completes  the  proof.  Using 
Jensen’s  inequality, 


E 


log 


^P  A«(/)e-pen(A) 


yp  AcA  (/)Ea«*  [y 


<  log  E 


g A&jf) 
UQ*(/) 

VP  Act(f)e-pen{&) 


yp  Act*  (/  )E  Act*  yj 


P  A&jf) 

*Aa*(f) 


Now 


E 


yp  A«(/)e-pen(A) 


yp  Aa*  (/)p 


Act* 


P A&(f) 

*Aa*(f). 


<  e~pen(o°  <  1. 
aeeL 


Since  E aq* 


log  1  55^ 


*(/) 


P  A&(f) 


KL(PAa«||PAA),  we  obtain 


Eacc* 


E  (Aa')r  -  (A&)‘/2 

_i=  1 

<  KL(PAce»  || Paq)  +2pen(a) 

=  min  [KL(PAa*||PAa)  +  2pen(a)] , 


which  proves  the  lemma. 


□ 


Lemma  10.7.  If  the  estimators  in  ©l  satisfy  the  condition  (10.2.7)  ,  then  the  follow¬ 
ing  inequality  holds: 

KL(FAa*  ||  PAct)  <  ilia*  -  a||?,  Va  G  ©L. 

Proof.  By  definition  of  the  KL  divergence, 

'P  Ac.*(f)\ 


KL(PAq,«  ||PAa)  =  E Aa* 


log 


M 

=  ^  E(Aa*) 

3= 1 
M 

=  E 

3= 1  L 
M 

<  J^(Aa 

3= 1 
M 


f  j  log 


(Aa* 


(Aa*)j  log 
(Aa 


(Aa)j 
(Aa)j 


Pa«(/)  J 

M 

—  'y  [  E(Aq*}j  [(Aa  )j  —  (Aa)j 

3= 1 


(Aa*)j  +  (Aa)j 


(Aa) 


(Aa*)j  +  (Aa)j 


5  (Aa)j 


(Aa*  —  Aa).  < -\\A(a*  -  a)f2 


<i||A(a*-a)||;<i||a*-a||?. 


c 
1 2 


The  first  inequality  uses  logt  <  t  —  1,  the  second  is  by  (10.2.7),  the  third  uses  the  fact 
that  the  t\  norm  dominates  the  £2  norm,  and  the  last  one  is  by  the  RIP-1  property 
(Lemma  8.8).  □ 


Lemma  10.8.  Given  two  Poisson  parameter  vectors  g,h  e  Wf ,  the  following  equality 
holds: 


2 108  /  \/P„(/)Pfc(/)<W) 


where  g  denotes  the  counting  measure  on  M.+  . 
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Proof. 


f  \/rJmJj)Mf) 


M 


"S, 


=  1 


Taking  logs,  we  obtain  the  lemma. 


□ 


10.2.5  Empirical  performance 

Here  we  present  a  simulation  study  that  corroborates  our  proposed  method.  In  this 
experiment,  compressive  Poisson  observations  are  collected  of  a  randomly  generated 
sparse  signal  passed  through  the  sensing  matrix  generated  using  the  proposed  ex¬ 
pander  graph  method.  We  then  reconstruct  the  signal  by  utilizing  an  algorithm  that 
minimizes  the  proposed  objective  function  in  (10.2.4),  and  assess  the  accuracy  of 
this  estimate.  We  repeat  this  procedure  over  several  trials  to  estimate  the  average 
performance  of  the  method. 

More  specifically,  we  generate  our  length  N  sparse  signal  ot*  through  a  two-step 
procedure.  First  we  select  k  elements  uniformly  at  random,  then  we  assign  these 
elements  an  intensity  I.  All  other  components  of  the  signal  are  set  to  zero.  For  these 
experiments,  we  chose  a  length  N  =  100,000  and  varied  the  sparsity  k  among  three 
different  choices  of  100,  500,  and  1,000  for  two  intensity  levels  /  of  10,000  and  100,000. 
We  then  vary  the  number  M  of  Poisson  observations  from  100  to  20,000  using  an 
expander  graph  sensing  matrix  with  degree  d  =  8.  Recall  that  the  sensing  matrix  is 
normalized  such  that  the  total  signal  intensity  is  divided  amongst  the  measurements, 
hence  the  seemingly  high  choices  of  I. 

To  reconstruct  the  signal,  we  utilize  the  SPIRAL-^  algorithm  [141]  which  solves 
(10.2.4)  when  pen(a)  =  r||a:||1.  This  algorithm  utilizes  a  sequence  of  quadratic 
subproblems  derived  by  using  a  second-order  Taylor  expansion  of  the  Poisson  log- 
likelihood  at  each  iteration.  These  subproblems  are  made  easier  to  solve  by  using  a 
separable  approximation  whereby  the  second-order  Hessian  matrix  is  approximated  by 
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a  scaled  identity  matrix.  For  the  particular  case  of  the  tx  penalty,  these  subproblems 
can  be  solved  quickly,  exactly,  and  noniteratively  by  a  soft-thresholding  rule. 

After  reconstruction,  we  assess  the  estimate  &  according  to  the  normalized  l\  error 
||«*  —  ck||i/||q:*||i.  We  select  the  regularization  weighting  r  in  the  SPIRAL-lb  algo¬ 
rithm  to  minimize  this  quantity  for  each  randomly  generated  experiment  indexed  by 
(/,  k,  M ).  To  assure  that  the  results  are  not  biased  in  our  favor  by  only  considering  a 
single  random  experiment  for  each  (/,  k,  M),  we  repeat  this  experiment  several  times. 
The  averaged  reconstruction  accuracy  over  10  trials  is  presented  in  Figure  10.1. 

These  results  show  that  the  proposed  method  is  able  to  accurately  estimate  sparse 
signals  when  the  signal  intensity  is  sufficiently  high;  however,  the  performance  of  the 
method  degrades  for  lower  signal  strengths.  More  interesting  is  the  behavior  as  we 
vary  the  number  of  measurements.  There  is  a  clear  phase  transition  where  accurate 
signal  reconstruction  becomes  possible,  however  the  performance  gently  degrades  with 
the  number  of  measurements  since  there  is  a  lower  signal-to-noise  ratio  per  measure¬ 
ment.  This  effect  is  more  pronounced  at  lower  intensity  levels,  as  we  more  quickly 
enter  the  regime  where  only  a  few  photons  are  collected  per  measurement.  Both  of 
these  results  support  the  error  bounds  developed  in  Section  10.2.2. 


10.3  Application:  Estimating  packet  arrival  rates 

This  section  describes  an  application  of  the  pMLE  estimator  of  Section  10.2:  an 
indirect  approach  for  reconstructing  average  packet  arrival  rates  and  instantaneous 
packet  counts  for  a  given  number  of  streams  (or  flows)  at  a  router  in  a  communi¬ 
cation  network,  where  the  arrivals  of  packets  in  each  flow  are  assumed  to  follow  a 
Poisson  process.  All  packet  counting  must  be  done  in  hardware  at  the  router,  and 
any  hardware  implementation  must  strike  a  delicate  balance  between  speed,  accu¬ 
racy,  and  cost.  For  instance,  one  could  keep  a  dedicated  counter  for  each  flow,  but, 
depending  on  the  type  of  memory  used,  one  could  end  up  with  an  implementation 
that  is  either  fast  but  expensive  and  unable  to  keep  track  of  a  large  number  of  flows 
(e.g.,  using  SRAMs,  which  have  low  access  times,  but  are  expensive  and  physically 
large)  or  cheap  and  high-density  but  slow  (e.g.,  using  DRAMs,  which  are  cheap  and 
small,  but  have  longer  access  times)  [108,  180]. 

However,  there  is  empirical  evidence  [109,  110]  that  flow  sizes  in  IP  networks  follow 
a  power-law  pattern:  just  a  few  flows  (say,  10%)  carry  most  of  the  traffic  (say,  90%). 
Based  on  this  observation,  several  investigators  have  proposed  methodologies  for  es¬ 
timating  flows  using  a  small  number  of  counters  by  either  (a)  keeping  track  only  of 
the  flows  whose  sizes  exceed  a  given  fraction  of  the  total  bandwidth  (the  approach 
suggestively  termed  “focusing  on  the  elephants,  ignoring  the  mice”)  [108]  or  (b)  using 
sparse  random  graphs  to  aggregate  the  raw  packet  counts  and  recovering  flow  sizes 
using  a  message  passing  decoder  [180]. 
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Relative  lj  Error  vs  Measurements  (10  trial  average) 


Number  of  Measurements  x  |  ()4 


Figure  10.1:  Average  performance  (as  measured  by  the  normalized  l\  error  ||a*  — 
a:||i/||a:*||i)  for  the  proposed  expander-based  observation  method  for  recovering 
sparse  signals  under  Poisson  noise.  In  this  experiment,  we  sweep  over  a  range  of 
measurements  and  consider  a  few  sparsity  ( k )  and  intensity  (/)  levels  of  the  true 
signal. 

We  consider  an  alternative  to  these  approaches  based  on  Poisson  CS,  assuming  that 
the  underlying  Poisson  rate  vector  is  sparse  or  approximately  sparse  —  and,  in  fact, 
it  is  the  approximate  sparsity  of  the  rate  vector  that  mathematically  describes  the 
power-law  behavior  of  the  average  packet  counts.  The  goal  is  to  maintain  a  com¬ 
pressed  summary  of  the  process  sample  paths  using  a  small  number  of  counters,  such 
that  it  is  possible  to  reconstruct  both  the  total  number  of  packets  in  each  flow  and  the 
underlying  rate  vector.  Since  we  are  dealing  here  with  Poisson  streams,  we  would  like 
to  push  the  metaphor  further  and  say  that  we  are  “focusing  on  the  whales,  ignoring 
the  minnows.” 
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10.3.1  Problem  Formulation 


We  wish  to  monitor  a  large  number  N  of  packet  flows  using  a  much  smaller  number  M 
of  counters.  Each  flow  is  a  homogeneous  Poisson  process  (cf.  [33]  for  details  pertaining 
to  Poisson  processes  and  networking  applications).  Specifically,  let  A*  G  denote 
the  vector  of  rates,  and  let  U  denote  the  random  process  U  =  {C/t}teM+  with  sample 
paths  in  Z+.  In  other  words,  for  each  i  G  {1, . . . ,  N},  the  Ah  component  of  U,  which 
we  will  denote  by  U^\  is  a  homogeneous  Poisson  process  with  the  rate  of  \  arrivals 
per  unit  time,  and  all  the  C/W’s  are  mutually  conditionally  independent  given  A. 

The  goal  is  to  estimate  the  unknown  rate  vector  A  based  on  f.  We  will  focus  on 
performance  bounds  for  power-law  network  traffic,  i.e. ,  for  A*  belonging  to  the  class 

£/3,l0  =  {A6Kf:  IIAIU  =  L0-  ak(  A)  =  0(k~ ?)}  (10.3.1) 

for  some  Lq  >  0  and  /3  >  1,  where  the  constant  hidden  in  the  O(-)  notation  may 
depend  on  Lq.  Here,  (5  is  the  power-law  exponent  that  controls  the  tail  behavior; 
in  particular,  the  extreme  regime  (3  — >  +oo  describes  the  fully  sparse  setting.  As  in 
Section  10.2,  we  assume  the  total  arrival  rate  1 1  A*  1 1 !  to  be  known  (and  equal  to  a  given 
Lq)  in  advance,  but  this  assumption  can  be  easily  dispensed  with  (cf.  Remark  10.2). 

As  before,  we  evaluate  each  candidate  estimator  A  =  A (/)  based  on  its  expected  l\ 
risk, 

A  (A,  A*)  =Ea.||A-A*||1. 

10.3.2  Two  estimation  strategies 

We  consider  two  estimation  strategies.  In  both  cases,  we  let  our  measurement  ma¬ 
trix  $  be  the  adjacency  matrix  of  the  expander  GklN  f°r  a  fixed  k  <  N/4  (see 
Section  10.2.2  for  definitions).  The  first  strategy,  which  we  call  the  direct  method , 
uses  standard  expander-based  CS  to  construct  an  estimate  of  A*.  The  second  is  the 
pMLE  strategy,  which  relies  on  the  machinery  presented  in  Section  10.2  and  can  be 
used  when  only  the  rates  are  of  interest. 

The  direct  method 

In  this  method,  which  will  be  used  as  a  “baseline”  for  assessing  the  performance 
of  the  pMLE,  the  counters  are  updated  in  discrete  time,  every  r  time  units.  Let 
x  =  {xiy}l/€z+  denote  the  sampled  version  of  U,  where  xv  =  UVT.  The  update  takes 
place  as  follows.  We  have  a  binary  matrix  $  G  {0,l}Mx7V,  and  at  each  time  v 
let  f  v  =  <hav  In  other  words,  f  is  obtained  by  passing  a  sampled  Wdimensional 
homogeneous  Poisson  process  with  rate  vector  A  through  a  linear  transformation  <h. 
We  emphasize  the  fact  that  this  observation  model  is  not  equivalent  to  sampling  a 
Poisson  process  with  rate  TA. 
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The  direct  method  uses  expander-based  CS  to  obtain  an  estimate  xv  of  xv  from 
fv  =  $*«/,  followed  by  letting 


A 


dir 


(10.3.2) 


V 


This  strategy  is  based  on  the  observation  that  xvj{yr)  is  the  maximum- likelihood 
estimator  of  A*.  To  obtain  xu,  we  need  to  solve  the  convex  program 


minimize  ||it||i  subject  to  <&u  =  fv 


which  can  be  cast  as  a  linear  program  [29].  The  resulting  solution  xu  may  have 
negative  coordinates,3  hence  the  use  of  the  (-)+  operation  in  (10.3.2).  We  then  have 
the  following  result: 

Theorem  10.9. 


(10.3.3) 


where  (A*)1/2  is  the  vector  with  components  *,  Vi. 


Remark  10.10.  Note  that  the  error  term  in  (10.3.3)  is  0{  1/y/v),  assuming  every¬ 
thing  else  is  kept  constant,  which  coincides  with  the  optimal  rate  of  the  l\  error  decay 
in  parametric  estimation  problems. 

Proof.  We  first  observe  that,  by  construction,  xu  satisfies  the  relations  Qxu  =  &xu 
and  ||£c„||i  —  11*1/ ||i-  Hence, 


E||ak,  —  i/rA*||i  <  E||aV  —  xu\\i  +  W\xv  —  i/rA*||i 
<  %E(jk(xu)  +  ¥,\\xv  -  vt\*\\i 


(10.3.4) 


where  the  first  step  uses  the  triangle  inequality,  while  the  second  step  uses  Proposi¬ 
tion  9.1  with  A  =  0.  To  bound  the  first  term  in  (10.3.4),  let  S  C  {1, . . .  ,N}  denote 
the  positions  of  the  k  largest  entries  of  A*.  Then,  by  definition  of  the  best  /e-term 
representation, 


Therefore, 


E ak(xv)  <  E  ^2  x»,i  =  vt  y.  A*  =  urak(X*). 


3Khajehnejad  et  al.  [169]  have  recently  proposed  the  use  of  perturbed  adjacency  matrices  of 
expanders  to  recover  nonnegative  sparse  signals. 
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To  bound  the  second  term,  we  can  use  concavity  of  the  square  root,  as  well  as  the 
fact  that  each  xVji  ~  Poisson (ut A*),  to  write 


E||a;„  —  ut  A*||i  =  E 


=  E 


N 


Y  \xn4  -  ut\* 


1=1 

N 


Y  V  (*n,i  -  ^X*)2 


i=l 


N 


N 


<  Y  v- e(*«m  _  utK)2  =  Y  V"rK- 


i= 1 


i=  1 


Now,  it  is  not  hard  to  show  that  || x„  —  ut A*||i  <  \\xu  —  ut A*||i.  Therefore, 

* (sf ,  V)  <  <  ^(V)  +  i(A*)1/2|l‘ 

V  /  UT 


which  proves  the  theorem. 


□ 


The  penalized  MLE  approach 


In  the  penalized  MLE  approach  the  counters  are  updated  in  a  slightly  different  man¬ 
ner.  Here  the  counters  are  still  updated  in  discrete  time,  every  r  time  units;  however, 
each  counter  i  e  {1,  •  •  •  ,  M}  is  updated  at  times  [ut  +  ,  and  only  aggre¬ 

gates  the  packets  that  have  arrived  during  the  time  period  [ut  +  ^r,  ut  +  j^r) . 
Therefore,  in  contrast  to  the  direct  method,  here  each  arriving  packet  is  registered  by 
at  most  one  counter.  Furthermore,  since  the  packets  arrive  according  to  a  homoge¬ 
neous  Poisson  process,  conditioned  on  the  vector  A*,  the  values  measured  by  distinct 
counters  are  independent4.  Therefore,  the  vector  of  counts  at  time  u  obeys 

fu  ~  Poisson  (A  a;  *)  where  cx*  =  —  X 

which  is  precisely  the  sensing  model  we  have  analyzed  in  Section  10.2. 

Now  assume  that  the  total  average  arrival  rate  ||A*||i  =  L0  is  known.  Let  A  be  a 
finite  or  a  countable  set  of  candidate  estimators  with  ||A||i  <  Lq  for  all  A  e  A,  and 
let  pen(-)  be  a  penalty  functional  satisfying  the  Kraft  inequality  over  A.  Given  u  and 
r,  consider  the  scaled  set 


a  a  vrd 

A-  =  ir A  ^ 


UTd 

~M 


A  :  A  G  A 


4The  independence  follows  from  the  fact  that  if  X\ ,  •  •  •  ,  Xm  are  conditionally  indepen¬ 
dent  random  variables,  then  for  any  choice  of  functions  gi,---  ,gM,  the  random  variables 
gi(Xi),  ■  ■  ■  ^m{Xm )  are  also  conditionally  independent. 


105 


with  the  same  penalty  function,  pen  A)  =  pen(A)  for  all  A  G  A.  We  can  now 
apply  the  results  of  Section  10.2.  Specifically,  let 

-pMLE  A  M  OL 

v  urd  ’ 

where  6t  is  the  corresponding  pMLE  estimator  obtained  according  to  (10.2.4).  The 
following  theorem  is  a  consequence  of  Theorem  10.3  and  the  remark  following  it: 


Theorem  10.11.  If  the  set  A  satisfies  the  strict  positivity  condition  (10.2.7),  then 
there  exists  some  absolute  constant  C  >  0,  such  that 


/  ^pMLE 

R  ^A„ 


<  4afc(A*) 


+  C 


min 

AeA 


k  log ( iV/ A: )  1 1 A  —  A*  || f  + 


k  L0  pen(A) 

VT 


(10.3.5) 


We  now  develop  risk  bounds  under  the  power-law  condition.  To  this  end,  let  us 
suppose  that  A*  is  a  member  of  the  power-law  class  Eio>/ 3  defined  in  (10.3.1).  Fix  a 
small  positive  number  6,  such  that  Lq/\/~8  is  an  integer,  and  define  the  set 

A  =  {a  g  :  || A||!  <  L0;  Aj  G  {sv^S^Vi} 

These  will  be  our  candidate  estimators  of  A*.  We  can  define  the  penalty  function 
pen(A)  =  || A||o log(5_1).  For  any  A  G  E ^lq  and  any  1  <  r  <  N  we  are  able  to  find 
some  A(rj  G  A,  such  that  || A('^  ||0  x  r  and 

||  A  —  A(r'-)  ||f  x  r-2/3  +  r6. 

Here  we  assume  that  <5  is  sufficiently  small,  so  that  the  penalty  term  k ' lo^ — -  dom¬ 
inates  the  quantization  error  rS.  In  order  to  guarantee  that  the  penalty  function 
satisfies  Kraft’s  inequality,  we  need  to  ensure  that 


n 

E  E  ^1. 

r=1  x^eA 
||AW||0=r 


For  every  fixed  r,  there  are  exactly  (N)  subspaces  of  dimension  r,  and  each  subspace 
contains  exactly 

8  <  (2NL0)~2,  (10.3.6) 


distinct  elements  of  A.  Therefore,  if 


then 


t(Nr)  {nVs)r<±(nL0Vsy<±f<i, 

r= 1  ^  '  r= 0  r= 1 
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and  Kraft’s  inequality  is  satisfied. 

Using  the  fact  that  klog(N/k)  =  0(kd),  we  can  bound  the  minimum  over  A  G  A  in 
(10.3.5)  from  above  by 


Ur-™  +  rtl°E(l5~1) 


nun 

l<r<N 


We  can  now  particularize  Theorem  10.11  to  the  power-law  case: 

Theorem  10.12. 


where  the  constants  implicit  in  the  O(-)  notation  depend  on  L0  and  j3. 

Note  that  the  risk  bound  here  is  slightly  worse  than  the  benchmark  bound  of  The¬ 
orem  10.9.  However,  as  we  will  see  in  Section  10.3.3,  the  pMLE  approach  obtains 
higher  empirical  accuracy. 

10.3.3  Experimental  Results 

Here  we  compare  penalized  MLE  with  -magic  [52],  a  universal  i\  minimization 
method,  and  with  SSMP  [30],  an  alternative  method  that  employs  combinatorial 
optimization.  U- magic  and  SSMP  both  compute  the  “direct”  estimator.  For  the  ease 
of  computation,  the  candidate  set  A  is  approximated  by  the  convex  set  of  all  positive 
vectors  with  bounded  i\  norm,  and  the  CVX  package  [134,  133]  is  used  to  directly 
solve  the  pMLE  objective  function  with  pen(0)  =  ||0||i. 

Figures  10.2(a)  through  10.4(b)  report  the  result  of  numerical  experiments,  where 
the  goal  is  to  identify  the  k  largest  entries  in  the  rate  vector  from  the  measured 
data.  Since  a  random  graph  is,  with  overwhelming  probability,  an  expander  graph, 
each  experiment  was  repeated  30  times  using  independent  sparse  random  graphs  with 
d  —  8. 

We  also  used  the  following  process  to  generate  the  rate  vector.  First,  given  the  power- 
law  exponent  f3,  the  magnitude  of  the  k  whales  where  chosen  according  to  a  power- 
law  distribution  with  parameter  (3.  The  positions  of  the  k  whales  were  then  chosen 
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Figure  10.2:  Relative  l\  error  as  a  function  of  number  of  whales  k,  for  hi -magic  (LP), 
SSMP  and  pMLE  for  different  choices  of  the  power-law  exponent  (3.  The  number  of 
flows  N  =  5000,  the  number  of  counters  M  =  800,  and  the  number  of  updates  is  40. 
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Figure  10.3:  Probability  of  successful  support  recovery  as  a  function  of  number  of 
whales  k.  for  Pi -magic  (LP),  SSMP  and  pMLE  for  different  choices  of  the  power- law 
exponent  f3.  The  number  of  flows  N  =  5000,  the  number  of  counters  M  =  800,  and 
the  number  of  updates  is  40. 
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(a)  Relative  i\  error  as  a  function  of  number  (b)  Probability  of  successful  support  recovery 
of  updates  v.  as  a  function  of  number  of  updates  v. 


Figure  10.4:  Performance  of  ^ -magic,  SSMP  and  pMLE  algorithms  as  a  function  of 
the  number  of  updates  v .  The  number  of  flows  is  N  —  5000,  the  number  of  counters  is 
M  =  800,  and  the  number  of  whales  is  k  =  30.  There  are  k  whales  whose  magnitudes 
are  assigned  according  to  a  power- law  distribution  with  (3  —  1,  and  the  remaining 
entries  are  minnows  with  magnitudes  determined  by  a  A/"(0, 10-6)  random  variable. 


uniformly  at  random.  Finally  the  N  —  k  minnows  were  sampled  independently  from  a 
A/"(0, 10~6)  distribution.  Figure  10.2  shows  the  relative  £\  error  ( ||  A  —  || i/|| A|| i)  of 
the  three  above  algorithms  as  a  function  of  k.  Note  that  in  all  cases  (3  =  1,  (3  =  1.5, 
and  (3  —  2,  the  pMLE  algorithm  provides  lower  £\  errors.  Similarly,  Figure  10.3 
reports  the  probabilities  of  exact  recovery  as  a  function  of  k.  Again,  it  turns  out 
that  in  all  three  cases  the  pMLE  algorithm  has  higher  probability  of  exact  support 
recovery  compared  to  the  two  direct  algorithms.  We  also  analyzed  the  impact 
of  changing  the  number  of  updates  on  the  accuracy  of  the  three  above  algorithms. 
The  results  are  demonstrated  in  Figure  10.4.  Here  we  fixed  the  number  of  whales  to 
k  =  30,  and  changed  the  number  of  updates  from  10  to  200.  It  turned  out  that  as  the 
number  of  updates  v  increases,  the  relative  £\  errors  of  all  three  algorithms  decrease 
and  their  probability  of  exact  support  recovery  consistently  increase.  Moreover,  the 
pMLE  algorithm  always  outperforms  the  A -magic  (LP),  and  SSMP  algorithms. 


110 


Part  IV 

Optimal  Model-Selection  via  the 
Reed- Muller  Frames 
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Chapter  11 


Two  Fundamental  Measures  of 
Coherence  and  Their  Role  in 
Model  Selection 

11.1  What  is  Model  Selection? 

11.1.1  Background 

In  compressed  sensing,  and  in  many  other  information  processing  and  statistics  prob¬ 
lems  involving  high- dimensional  data,  the  curse  of  dimensionality  can  often  be  broken 
by  exploiting  the  fact  that  real-world  data  tend  to  live  on  low-dimensional  manifolds. 
This  phenomenon  is  exemplified  by  the  important  special  case  in  which  a  data  vector 
ol*  G  RN  satisfies  ||«*||0  =  Ejli  l{n*|>o}  <  k  -C  N  and  is  observed  according  to  the 
linear  measurement  model  f  =  $«*  +  eM •  Here,  $  is  an  M  x  N  (real-  or  complex¬ 
valued)  matrix  called  the  sensing  or  design  matrix,  while  G  Mm  represents  noise 
in  the  measurement  system. 

Fundamentally,  given  a  measurement  vector  f  =  &ot*+eM  in  the  compressed  setting, 
there  are  three  complementary — but  nonetheless  distinct — questions  that  might  be 
asked. 

[Sparse  Approximation]  Under  what  conditions  can  we  obtain  a  reliable  esti¬ 
mate  of  a  k- sparse  A  from  /? 

[Regression]  Under  what  conditions  can  we  reliably  approximate  <I>A  correspond¬ 
ing  to  a  /c-sparse  A  from  /? 

[Model  Selection]  Under  what  conditions  can  we  reliably  recover  the  locations 
of  the  nonzero  entries  of  a  /c-sparse  A  (in  other  words,  the  model  S  ==  {i  G 
{!,•••,  N}  :  \dii\  >  0})  from  /? 
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Algorithm  8  The  One-Step  Thresholding  (OST)  Algorithm  for  Model  Selection 
Input:  An  M  x  N  matrix  <&,  a  vector  f  G  CM,  and  a  threshold  A  >  0 
Output:  An  estimate  S  C  {1, . . .  ,p}  of  the  true  model  S 

a  <I>  /  {Form  signal  proxy} 

5  <—  {i  &  {1, . . . ,  N}  :  |dj|  >  A}  {Select  model  via  OST} 


In  Parts  II  and  III  of  this  thesis,  we  focused  on  efficient  algorithms  for  (approximately) 
solving  the  sparse  approximation  and  the  regression  problems.  In  many  application 
areas,  however,  the  model-selection  question  is  equally — if  not  more — important  than 
the  other  two  questions.  In  particular,  the  problem  of  model  selection  (sometimes 
also  known  as  variable  selection  or  sparsity  pattern  recovery )  arises  indirectly  in  a 
number  of  contexts,  such  as  subset  selection  in  linear  regression  [193],  estimation  of 
structures  in  graphical  models  [192],  and  signal  denoising  [73].  In  addition,  solving 
the  model-selection  problem  in  some  (but  not  all)  cases  also  enables  one  to  solve  the 
sparse  approximation  and/or  the  regression  problem. 


11.1.2  Main  Contributions 


Model  Selection:  One  of  the  primary  objectives  of  this  chapter  is  to  study  the 
problem  of  polynomial  time,  model-order  agnostic  model  selection  in  a  compressed 
setting  for  the  general  case  of  arbitrary  (random  or  deterministic)  design  matrices  and 
arbitrary  nonzero  entries  of  the  signal.  In  order  to  accomplish  this  task,  we  introduce 
two  fundamental  measures  of  coherence  among  the  (normalized)  columns  {^6  CM} 
of  the  M  X  N  design  matrix  $,  namely,1 


•  Worst-Case  Coherence :  /i(<fr)  =  max  I  (ipi,  ipj)  I ,  and 

i-r-i  'J 


•  Average  Coherence: 


iv^i  max 

l 


E  {<Pi,  <Pj) 


Roughly  speaking,  worst-case  coherence — which  has  been  introduced  in  Section  3.4.2 — 
is  a  similarity  measure  between  the  columns  of  a  design  matrix:  the  smaller  the  worst- 
case  coherence,  the  less  similar  the  columns.  On  the  other  hand,  average  coherence  is 
a  measure  of  the  spread  of  the  columns  of  a  design  matrix  within  the  M-dimensional 
unit  ball:  the  smaller  the  average  coherence,  the  more  spread  out  the  column  vectors. 

Our  main  contribution  in  the  area  of  model  selection  is  that  we  make  use  of  these 
two  measures  of  coherence  to  propose  and  analyze  a  model-order  agnostic  threshold 

1Here,  and  throughout  the  rest  of  this  chapter,  we  assume  without  loss  of  generality  that  <t>  has 
unit  £2_norm  columns.  This  is  because  deviations  to  this  assumption  can  always  be  accounted  for 
by  appropriately  scaling  the  entries  of  the  data  vector  a*  instead. 
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for  the  one-step  thresholding  (OST)  algorithm  (see  Algorithm  8)  for  model  selection. 
Specifically,  we  characterize  in  Section  11.2  both  the  exact  and  the  partial  model- 
selection  performance  of  OST  in  a  non- asymptotic  setting  in  terms  of  /i  and  v.  In 
particular,  we  establish  in  Section  11.2  that  if  /i(T>)  x  M-1/2  and  M_1  then 

OST — despite  being  computationally  primitive — can  perform  near-optimally  for  the 
case  when  either  (i)  the  energy  of  any  nonzero  entry  of  ol*  is  not  too  far  away  from 
the  average  signal  energy  per  nonzero  entry  ||aT|||//c  or  (ii)  the  signal-to-noise  ratio 
(snr)  in  the  measurement  system  is  not  too  high.  Equally  importantly,  in  contrast 
to  some  of  the  existing  literature  on  model  selection,  this  analysis  holds  for  arbitrary 
values  of  the  nonzero  entries  of  a*  and  it  does  not  require  the  M  x  k  submatrices  of 
the  design  matrix  <I>  to  have  full  column  rank. 

11.1.3  Relationship  to  Previous  Work 

The  problems  of  model  selection  and  sparse-signal  recovery  in  general  and  the  use  of 
OST  (also  known  as  simple  thresholding  [90]  and  marginal  regression  [120])  to  solve 
these  problems  in  particular  have  a  rich  history  in  the  literature.  In  the  context 
of  model  selection  in  the  compressed  setting,  Mallow’s  Cn  selection  procedure  [187] 
and  the  Akaike  information  criterion  (AIC)  [3] — both  of  which  essentially  attempt  to 
solve  a  complexity-regularized  version  of  the  least-squares  criterion — are  considered  to 
be  seminal  works,  and  are  known  to  perform  well  empirically  as  well  as  theoretically; 
see,  e.g.,  [188]  and  the  references  therein.  These  two  procedures  have  been  modified 
by  numerous  researchers  over  the  years  in  order  to  improve  their  performance — the 
most  notable  variants  being  the  Bayesian  information  criterion  (BIC)  [224]  and  the 
risk  inflation  criterion  (RIC)  [115].  Solving  model-selection  procedures  such  as  Cn, 
AIC,  BIC,  and  RIC,  however,  is  known  to  be  an  NP-hard  problem  [197]  even  if  the 
true  model  order  k  is  made  available  to  these  procedures. 

In  order  to  overcome  the  computational  intractability  of  these  model-selection  proce¬ 
dures,  several  methods  based  on  convex  optimization  have  been  proposed  by  various 
researchers  in  recent  years.  Among  these  proposed  methods,  the  LASSO  [233]  has  ar¬ 
guably  become  the  standard  tool  for  model  selection,  which  can  be  partly  attributed 
to  the  theoretical  guarantees  provided  for  the  LASSO  in  [192,  261,  247,  50].  In  par¬ 
ticular,  the  results  reported  in  [192,  261]  establish  that  the  LASSO  asymptotically 
identifies  the  correct  model  under  certain  conditions  on  the  design  matrix  <f>  and  the 
sparse  vector  a*.  Later,  Wainwright  in  [247]  strengthens  the  results  of  [192,  261]  and 
makes  explicit  the  dependence  of  exact  model  selection  using  the  LASSO  on  the  small¬ 
est  (in  magnitude)  nonzero  entry  of  a*.  However,  apart  from  the  fact  that  the  results 
reported  in  [192,  261,  247,  221]  are  for  exact  model  selection  and  are  only  asymptotic 
in  nature,  the  main  limitation  of  these  works  is  that  explicit  verification  of  the  condi¬ 
tions  (such  as  the  irrepresentable  condition  of  [261]  and  the  incoherence  condition  of 
[247])  that  a  generic  design  matrix  needs  to  satisfy  is  computationally  intractable 
for  k  ^3  yU-1.  The  most  general  (and  non-asymptotic)  model-selection  results  using  the 
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LASSO  for  arbitrary  design  matrices  have  been  reported  in  [50].  Specifically,  Candes 
and  Plan  have  established  in  [50]  that  the  LASSO  correctly  identifies  most  models 
with  probability  1  —  0(At-1)  under  certain  conditions  on  the  smallest  nonzero  entry 
of  a*  provided:  (i)  the  spectral  norm  (the  largest  singular  value)  and  the  worst-case 
coherence  of  $  are  not  too  large,  and  (ii)  the  values  of  the  nonzero  entries  of  a*  are 
independent  and  statistically  symmetric  around  zero.  Despite  these  recent  theoreti¬ 
cal  triumphs  of  the  LASSO,  it  is  still  desirable  to  study  alternative  solutions  to  the 
problem  of  polynomial  time,  model-order  agnostic  model  selection  in  a  compressed 
setting.  This  is  because 

1.  LASSO  solves  a  detection  problem  by  solving  a  (more  complicated)  estimation 
problem. 

2.  LASSO  requires  the  minimum  singular  values  of  the  submatrices  of  <I>  corre¬ 
sponding  to  the  true  models  to  be  bounded  away  from  zero  [192,  261,  247,  50]. 
While  this  is  a  plausible  condition  for  the  case  when  one  is  interested  in  es¬ 
timating  a *,  it  is  arguable  whether  this  condition  is  necessary  for  the  case  of 
model  selection. 

3.  The  current  literature  on  model  selection  using  the  LASSO  lacks  guarantees 
beyond  k  ^  p-1  for  the  case  of  generic  design  matrices  and  arbitrary  nonzero 
entries.  In  particular,  given  an  arbitrary  design  matrix  [192,  261,  247,  50] 
do  not  provide  any  guarantees  beyond  k  £3  y/M  for  even  the  simple  case  of 
a*  e 

4.  The  computational  complexity  of  the  LASSO  for  generic  design  matrices  tends 
to  be  0(N 3)  [120].  This  makes  the  LASSO  computationally  demanding  for 
large-scale  model-selection  problems. 

Recently,  a  few  researchers  have  raised  somewhat  similar  concerns  about  the  LASSO 
and  revisited  the  much  older  (and  oft-forgotten)  method  of  thresholding  for  model 
selection  [223,  113,  214,  120],  which  has  computational  complexity  of  only  one  matrix- 
vector  multiplication.  Algorithmically,  this  makes  our  approach  to  model  selection 
similar  to  that  of  [223,  113,  214,  120].  Nevertheless,  the  OST  algorithm  presented  in 
this  chapter  differs  from  [223,  113,  214,  120]  in  five  key  aspects: 

1.  Model-Order  Agnostic  Model  Selection:  Unlike  [223,  113,  214,  120],  the  OST 
algorithm  presented  in  this  chapter  is  completely  agnostic  to  both  the  true 
model  order  k  and  any  estimate  of  k. 

2.  Generic  Design  Matrices  and  Arbitrary  Nonzero  Entries:  The  results  reported 
in  this  chapter  hold  for  arbitrary  (random  or  deterministic)  design  matrices 
and  do  not  assume  any  statistical  prior  on  the  values  of  the  nonzero  entries 
of  a*  even  when  k  scales  linearly  with  M.  In  contrast,  [113]  only  studies 
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the  problem  of  Gaussian  design  matrices  whereas  the  most  influential  results 
reported  in  [223,  214,  120]  assume  that  the  values  of  the  nonzero  entries  of  a* 
are  independent  and  statistically  symmetric  around  zero. 

3.  Verifiable  Sufficient  Conditions:  In  contrast  to  [223,  113,  214,  120],  we  relate  the 
model-selection  performance  of  OST  to  two  global  parameters  of  <&,  namely,  // 
and  v,  which  are  trivially  computable  in  polynomial  time:  /u($)  =  ||$T$  —  IW^ 

andU*)  =  ]vhll(*T*-Ul|U. 

4.  Non- Asymptotic  Theory:  Similar  to  [113,  214,  120],  the  analysis  in  this  chapter 
can  be  used  to  establish  that  OST  achieves  (asymptotically)  consistent  model 
selection  under  certain  conditions.  However,  the  results  reported  in  this  chapter 
are  completely  non-asymptotic  in  nature  (with  explicit  constants)  and  thereby 
shed  light  on  the  rate  at  which  OST  achieves  consistent  model  selection. 

5.  Partial  Model  Selection:  In  addition  to  the  exact  model-selection  performance  of 
OST,  we  also  characterize  in  the  chapter  its  partial  model-selection  performance. 
In  this  regard,  we  establish  that  the  universal  threshold  proposed  in  Section  11.2 
for  OST  guarantees  S  C  S  with  high  probability  and  we  quantify  the  cardinality 
of  the  estimate  S.  On  the  other  hand,  both  [223]  and  [113]  study  only  exact 
model  selection,  whereas  [120,  214]  study  approximate  (though  not  partial) 
model  selection  only  for  Gaussian  design  matrices  [120]  and  assuming  Gaussian 
(resp.  statistical)  priors  on  the  nonzero  entries  of  a*  [214]  (resp.  [120]). 


11.2  Model  Selection  Using  One-Step  Threshold¬ 
ing 

11.2.1  Assumptions 

Before  presenting  our  results  on  model  selection  using  OST,  we  need  to  be  mathemat¬ 
ically  precise  about  our  problem  formulation.  To  this  end,  we  begin  by  reconsidering 
the  measurement  model  f  =  $«*  +  eM  and  assume  that  $  is  an  M  x  N  real-  or 
complex-valued  design  matrix  having  unit  f2-norm  columns,  a*  G  is  a  k- sparse 
signal  (||a:*||o  <  k ),  and  k  <  M  <  N.  Here,  we  allow  $  to  be  either  a  random  or  a 
deterministic  design  matrix,  while  we  take  eM  to  be  a  complex  additive  white  Gaus¬ 
sian  noise  vector.  It  is  worth  mentioning  that  Gaussianity  of  eM  is  just  a  simplified 
assumption  for  the  sake  of  this  exposition;  in  particular,  the  results  presented  in  this 
section  are  readily  generalizable  to  other  noise  distributions  as  well  as  perturbations 
having  bounded  f2-norms.  Finally,  the  main  assumption  that  we  make  here  is  that 
the  true  model  S  =  {i  G  {1,. . . ,  N}  :  |a*|  >  0}  is  a  uniformly  random  A;-subset  of 
{1, . . . ,  N}.  In  other  words,  we  have  a  uniform  prior  on  the  support  of  the  data  vector 
a*. 
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rB  =  Set  of  signals  in 
Ei  —  S2,  supported  on 
“bad”  subsets 


E2  =  Set  of  all  fc-sparse  signals  in  E] 
for  which  k  \[M 


Ei  =  Space  of  all  Vsparse  unimodal  signals  in  Rivsuch  that  k  ^ 


M 

log  IV 


Figure  11.1:  A  Venn  digram  used  to  illustrate  the  major  difference  between  the 
BP-based  recovery  guarantees  and  the  OST-based  recovery  guarantees  for  fc-sparse 
unimodal  signals  in  measured  using  Alltop  Gabor  frames.  The  OST  algorithm 
is  guaranteed  to  recover  a*  G  Ei  —  But  BP,  unlike  OST,  is  only  guaranteed  to 
recover  ol*  G  S2  in  this  case. 

11.2.2  Main  Results 

Intuitively  speaking,  successful  model  selection  requires  the  columns  of  the  design 
matrix  to  be  incoherent.  In  the  case  of  the  LASSO,  this  notion  of  incoherence  has 
been  quantified  in  [261]  and  [247]  in  terms  of  the  irrepresentable  condition  and  the 
incoherence  condition,  respectively  (see  also  [50]).  In  contrast  to  earlier  work  on  model 
selection,  however,  we  formulate  this  idea  of  incoherence  in  terms  of  the  coherence 
property. 

Definition  11.1  (The  Coherence  Property).  An  M  x  N  design  matrix  $  having  unit 
i2-norm  columns  is  said  to  obey  the  coherence  property  if  the  following  two  conditions 
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hold: 


0.1 


and 


(CP-1) 

(CP-2) 


M*)< 

z/($)  < 


y/2  log  IV  ’ 
If 

Vm ' 


In  words,  (CP-1)  roughly  states  that  the  columns  of  $  are  not  too  similar,  while 
(CP-2)  roughly  states  that  the  columns  of  $  are  somewhat  distributed  within  the  M- 
dimensional  unit  ball.  Note  that  the  coherence  property  is  superior  to  other  measures 
of  incoherence  such  as  the  irrepresentable  condition  in  two  key  aspects.  First,  it  does 
not  require  the  singular  values  of  the  submatrices  of  $  to  be  bounded  away  from  zero. 
Second,  it  can  be  easily  verified  in  polynomial  time  since  it  simply  requires  checking 
that 

-  I  Woo  <  (200  log  N)~1/2  and  ||  ($T$  -  1)1^  <  (N  -  l)M~l/2\\<f>T <f>  -  1]^. 


Below,  we  describe  the  implications  of  the  coherence  property  for  both  the  exact  and 
the  partial  model-selection  performance  of  OST.  Before  proceeding  further,  however, 
it  is  instructive  to  first  define  some  fundamental  quantities  pertaining  to  the  problem 
of  model  selection  as  follows: 


1 1  1 1  min 


i£S 


•  ||  ^  ||  n 

SNRmin  =  — -j 

E[||ejvr 


MAR 

SNR 


|  Q:  1 1  min 

PiiP’ 

||a*||l 

E[||eM|||]  • 


In  words,  ||a||min  is  the  magnitude  of  the  smallest  nonzero  entry  of  «*,  while  MAR — 
which  is  termed  the  minimum-to- average  ratio  [113] — is  the  ratio  of  the  energy  in 
the  smallest  nonzero  entry  of  a*  and  the  average  signal  energy  per  nonzero  entry  of 
a*.  Likewise,  SNRmin  is  the  ratio  of  the  energy  in  the  smallest  nonzero  entry  of  a* 
and  the  average  noise  energy  per  nonzero  entry,  while  SNR  simply  denotes  the  usual 
signal-to-uoise  ratio  in  the  system.  It  is  easy  to  see  that  SNRmin  =  SNR  •  mar.  We  are 
now  ready  to  state  the  first  main  result  of  this  chapter  that  concerns  the  performance 
of  OST  in  terms  of  exact  model  selection. 


Theorem  11.2  (Exact  Model  Selection  Using  OST).  Suppose  that  the  design  matrix 
$  satisfies  the  coherence  property  and  let  eM  be  distributed  as  Af(Q,  cr2 1).  Next, 
choose  the  threshold  A  =  max  j  ll0/x\/ M  ■  SNR,  a/2ct2  log  N  for  any  t  e  (0, 1). 

Then,  if  we  write  //(<&)  as  p  —  cqM-1 ^  for  some  ci  >  0  (which  may  depend  on  N ) 
and  7  e  {0}  U  [2,  oo),  the  OST  algorithm  (Algorithm  8)  satisfies  Pr(5  j -  S)  <  6 N~x 
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provided  N  >  128  and  the  number  of  measurements  satisfies 


M  >  max 


2  k  log  N, 


c2k  log  N 
SNRmin 


/  c3k  log  N 
\  MAR 


=  max 


2k  log  N 


c2k  log  N 
’  SNR  •  MAR’ 


/  c3k  log  N 
\  MAR 


(11.2.1) 


Here,  the  quantities  c2,c3  >  0  are  defined  as  c2  =  16(1  —  t)  2  and  c3  =  800c2f  2 ,  while 
the  probability  of  failure  is  with  respect  to  the  true  model  S  and  the  noise  vector  eM. 


The  proof  of  this  theorem  is  provided  in  Section  11.3.  Note  that  the  parameter  t  in 
Theorem  11.2  can  always  be  fixed  a  priori  (say  t  =  1/2)  without  affecting  the  scaling 
relation  in  (11.2.1).  In  practice,  however,  t  should  be  chosen  so  as  to  reduce  the  total 
number  of  measurements  needed  to  ensure  successful  model  selection;  the  optimal 
choice  of  t  in  this  regard  is  topt  =  arg  mint  (max  j  Notice 

also  that  Theorem  11.2  is  best  suited  for  applications  where  one  is  interested  in 
quantifying  the  minimum  number  of  measurements  needed  to  guarantee  exact  model 
selection  for  a  given  class  of  signals.  Alternatively,  it  might  be  the  case  in  some  other 
applications  that  the  problem  dimensions  are  fixed  and  one  is  instead  interested  in 
specifying  the  class  of  signals  that  leads  to  successful  model  selection.  The  following 
variant  of  Theorem  11.2  is  best  suited  in  such  situations. 

Theorem  11.3.  Suppose  that  the  design  matrix  $  satisfies  the  coherence  property 
and  let  the  noise  vector  eM  be  distributed  asj\f( 0,  a2 1).  Next,  let  N  >  128  and  choose 

the  threshold  A  =  max  j^lO pV  M  ■  SNR,  j^v/2|\/2a2  log N  for  any  t  e  (0, 1).  Then 

the  OST  algorithm  (Algorithm  8)  satisfies  Pr(5  ^  S)  <  6 N~l  as  long  as  we  have  that 
k  <  M/(2\ogN)  and 

f  c2k  log  N  d,k  log  N 1  .  , 

mar  >  max  <  — — — — ,  — - I— 11.2.2 

I  M  ■  SNR  ll~2  f  {  ' 

Here,  c2  >  0  is  as  defined  in  Theorem  11.2,  c'3  >  0  is  defined  as  c'3  =  800t~2,  and  the 
probability  of  failure  is  with  respect  to  the  true  model  S  and  the  noise  vector  eM- 


Note  that  the  proof  of  Theorem  11.3  follows  directly  from  the  proof  of  Theorem  11.2. 
There  are  a  few  important  remarks  that  need  to  be  made  at  this  point  concerning  the 
threshold  proposed  in  Theorem  11.2  and  Theorem  11.3  for  the  OST  algorithm.  First, 
it  is  easy  to  see  that  the  proposed  threshold  is  completely  agnostic  to  the  model 
order  k  and  only  requires  knowledge  of  the  SNR  and  the  noise  variance.  Second, 
extensive  simulations  suggest  that  the  absolute  constant  10  in  the  proposed  threshold 
is  somewhat  conservative  and  can  be  reduced  through  the  use  of  more  sophisticated 
analytical  tools.  Finally,  while  estimating  the  true  model  order  k  tends  to  be  harder 
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Algorithm  9  The  Sorted  One-Step  Thresholding  (SOST)  Algorithm  for  Model  Se¬ 
lection _ 

Input:  An  M  x  N  matrix  <F,  a  vector  f  G  CM,  and  model  order  k 
Output:  An  estimate  S  C  {1, . . . ,  AV}  of  the  true  model  S 

at  I  h  (  d*  /)  {Form  signal  proxy} 

S  <—  Supp(a) 


than  estimating  the  SNR  and  the  noise  variance  a2  in  majority  of  the  situations,  it 
might  be  the  case  that  estimating  k  is  easier  in  some  applications.  It  is  better  in 
such  situations  to  work  with  a  slight  variant  of  the  OST  algorithm  (see  Algorithm  9) 
that  relies  on  knowledge  of  the  model  order  k  instead  and  returns  an  estimate  S 
corresponding  to  the  k  largest  (in  magnitude)  entries  of  4>/.  We  characterize  the 
performance  of  this  algorithm — which  we  call  sorted  one-step  thresholding  (SOST) 
algorithm — in  terms  of  the  following  theorem. 


Theorem  11.4  (Exact  Model  Selection  Using  SOST).  Suppose  that  the  design  ma¬ 
trix  $  satisfies  the  coherence  property  and  let  the  noise  vector  eM  be  distributed  as 
J\f(0,cr2I).  Next,  write  /x(<&)  as  n  —  c\M~1^  for  some  c\  >  0  (which  may  depend 
on  N )  and  7  G  {0}  U  [2,  00).  Then  the  SOST  algorithm  (Algorithm  9)  satisfies 
Pr(5  jt:  S)  <  6iV-1  as  long  as  N  >  128  and  the  number  of  measurements  satisfies 


M  >  min  max 

te(o,i) 


2  A;  log  N, 


c2k  log  N 
SNRmin 


/  c3k  log  A 
\  MAR 


=  min  max 

*6(0,1) 


2  k  log  N 


c2k  log  N 
’  SNR  •  MAR’ 


/  c:ik  log  N 
\  MAR 


(11.2.3) 


Here,  the  quantities  c2,c3  >  0  are  as  defined  in  Theorem  11.2,  while  the  probability  of 
failure  is  with  respect  to  the  true  model  S  and  the  noise  vector  eM. 


The  final  result  that  we  present  in  this  section  concerns  the  partial  model-selection 
performance  of  OST.  Specifically,  note  that  our  focus  in  this  section  has  so  far  been  on 
specifying  conditions  for  either  the  number  of  measurements  or  the  MAR  of  the  signal 
that  ensure  exact  model  selection.  In  many  real-world  applications,  however,  the 
parameters  of  the  problem  are  fixed  and  it  is  not  always  possible  to  ensure  that  either 
the  number  of  measurements  or  the  MAR  of  the  signal  satisfy  the  aforementioned 
conditions.  A  natural  question  to  ask  then  is  whether  the  OST  algorithm  completely 
fails  in  such  circumstances  or  whether  any  guarantees  can  still  be  provided  for  its 
performance.  We  address  this  aspect  of  the  OST  algorithm  in  the  following  and 
show  that,  even  if  the  MAR  of  a*  is  very  small,  OST  has  the  ability  to  identify  the 
locations  of  the  nonzero  entries  of  ol*  whose  energies  are  greater  than  both  the  noise 
power  and  the  average  signal  energy  per  nonzero  entry.  In  order  to  make  this  notion 
mathematically  precise,  we  first  define  the  l-th  largest-to-average  ratio  (LAR/)  of  ct* 
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as  the  ratio  of  the  energy  in  the  l-th  largest  (in  magnitude)  nonzero  entry  of  a*  and 
the  average  signal  energy  per  nonzero  entry  of  a. *;  that  is, 


LAR; 


\a 


(O' 


ll«*ll!A 


where  a*^  denotes  the  Z-tli  largest  nonzero  entry  of  a*  (note  that  MAR  =  LAR*.). 
We  are  now  ready  to  specify  the  partial  model-selection  performance  of  the  OST 
algorithm. 


Theorem  11.5  (Partial  Model  Selection  Using  OST).  Suppose  that  the  design  ma¬ 
trix  satisfies  the  coherence  property.  Next,  let  N  >  128  and  eM  be  distributed 
as  A/"(0,cr2/).  Finally,  fix  a  parameter  t  e  (0,1)  and  choose  the  threshold  A  = 

max  ||10/ia/M  ■  SNR,  ^-_\/2 |a/2 cr2  log  N.  Then,  under  the  assumption  that  k  < 
M/ (2  log  N) ,  the  OST  algorithm  (Algorithm  8)  guarantees  with  probability  exceed¬ 
ing  1  —  61W1  that  S  C  S  and  |«S  —  <S|  <  (k  —  L),  where  L  is  the  largest  integer  for 
which  the  following  inequality  holds: 


LARx  >  max 


c2k  log  N  cfk  log  N 
M  ■  SNR  ’  p~2 


(11.2.4) 


Here,  the  quantities  C2,c'3  >  0  are  as  defined  in  Theorem  11.3,  while  the  probability  of 
failure  is  with  respect  to  the  true  model  S  and  the  noise  vector  eM. 


11.2.3  LASSO  versus  OST 

Historically,  OST  (and  its  variants)  was  preferred  over  the  LASSO  because  of  its 
low  computational  complexity.  The  results  reported  in  this  chapter,  however,  bring 
forth  another  important  aspect  of  OST  (also  see  [120]):  OST  can  lead  to  successful 
model  selection  even  when  the  LASSO  fails.  Specifically,  model  selection  using  the 
LASSO  is  in  fact  a  byproduct  of  signal  reconstruction,  whereas  the  OST  results  do 
not  guarantee  signal  reconstruction  without  imposing  additional  constraints  on  <f>. 

In  Chapter  12  we  will  introduce  the  Reed-Muller  frames  as  examples  of  design  ma¬ 
trices  with  optimal  coherence  parameters.  We  will  then  show  cases  in  which  LASSO 
completely  fails  in  recovering  the  support  of  a  sparse  vector,  whereas  the  OST  algo¬ 
rithm  can  successfully  recover  the  support  of  the  same  sparse  vector.  In  other  words, 
model  selection  is  inherently  an  easier  problem  than  signal  reconstruction. 


11.3  Proofs  of  Main  Results 

In  this  section,  we  provide  detailed  proofs  of  the  main  results  reported  in  Section  11.2. 
Before  proceeding  further,  however,  it  is  advantageous  to  develop  some  notation  that 
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will  facilitate  our  forthcoming  analysis.  In  this  regard,  recall  that  the  true  model  S  is 
taken  to  be  a  uniformly  random  h-subset  of  [N]  =  {1, . . . ,  N}.  We  can  therefore  write 
the  data  vector  a*  under  this  assumption  as  concatenation  of  a  random  permutation 
matrix  and  a  deterministic  A;-sparse  vector.  Specifically,  let  z  G  CN  be  a  deterministic 
A;-sparse  vector  that  we  write  (without  loss  of  generality)  as 


=  z£Ck  (N—k)  times 


and  let  Pn  be  an  N  x  N  random  permutation  matrix;  in  other  words, 

P7T=[e7ri  en2  ...  enN]T  (11.3.2) 

where  ej  denotes  the  j-th  column  of  the  canonical  basis  /  and  hi  =  (7Ti, . . .  is 
a  random  permutation  of  [N],  Then  the  assumption  that  the  model  S  is  a  random 
subset  of  [N]  is  equivalent  to  stating  that  the  data  vector  a*  can  be  written  as 
a*  =  Pnz.  In  other  words,  the  measurement  vector  /  can  be  expressed  as 

f  =  $«*  +  eM  =  ®PirZ  +  eM  =  z  +  eM  (11.3.3) 

where  II  =  (ni, . . , ,  Hu)  denotes  the  first  k  elements  of  the  random  permutation  fl,  $n 
denotes  the  M  x  k  submatrix  obtained  by  collecting  the  columns  of  $  corresponding 
to  the  indices  in  II,  and  the  vector  z  £  <Ck  represents  the  k  nonzero  entries  of  a*. 

Proof  of  Theorem  11.2 

The  general  road  map  for  the  proof  of  Theorem  11.2  is  as  follows.  Below,  we  first 
introduce  the  notion  of  (k,e,  5)- statistical  orthogonality  condition  (StOC).  We  next 
establish  the  relationship  between  the  StOC  parameters  and  the  worst-case  and  av¬ 
erage  coherence  of  $  in  Lemma  11.8  and  Lemma  11.9.  We  then  provide  a  proof  of 
Theorem  11.2  by  first  showing  that  if  $  satisfies  the  StOC  then  OST  recovers  S  with 
high  probability  and  then  relating  the  results  of  Lemma  11.8  and  Lemma  11.9  to  the 
coherence  property. 

Definition  11.6  (( k ,  e,  <5)-Statistical  Orthogonality  Condition).  Let  11  =  (7Ti, . . . ,  7Tat) 
be  a  random  permutation  of  [IV] ;  and  define  II  =  (7Ti, . . . ,  7Tfc)  and  LL  =  (nk+i,  ■  ■  ■ ,  7Tn) 
for  any  k  €  [N],  Then  the  M  x  N  (normalized)  design  matrix  $  is  said  to  satisfy 
the  (k,  e,  5) -statistical  orthogonality  condition  if  there  exist  e,  5  G  [0,1)  such  that  the 
inequalities 

IK^n-Z^IU  <e|M|2  (StOC-1) 

||$5c$nz||oo  <  e|M|2  (StOC-2) 

hold  for  every  fixed  zG  Cfe  with  probability  exceeding  1  —  5  (with  respect  to  the  random 
permutation  fl). 
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Remark  11.7.  Note  that  the  StOC  derives  its  name  from  the  fact  that  if  $  is  a 
N  x  N  orthonormal  matrix  then  it  trivially  satisfies  the  StOC  for  every  k  €  [N]  with 
e  =  5  =  0.  In  addition,  although  we  will  not  use  this  fact  explicitly  in  the  chapter, 
it  can  be  checked  that  if  satisfies  ( k,e,S)-StoC  then  it  approximately  preserves  the 
d2-norms  of  k -sparse  signals  with  probability  exceeding  1  —  5  as  long  as  k  <  e~2. 

Having  defined  StOC,  our  goal  in  the  next  two  lemmas  is  to  relate  the  StOC  pa¬ 
rameters  k,  e,  and  6  to  the  worst-case  and  average  coherence  of  the  design  matrix 

Lemma  11.8.  Let  n  =  (7Ti, . . .  ,7 p.)  denote  the  first  k  elements  of  a  random  per¬ 
mutation  of  [TV]  and  choose  a  parameter  a  >  1.  Then,  for  any  e  G  [0,1),  k  < 
min  {e2v~2,  (1  +  a)~LN} ,  and  fixed  z  e  Ck,  we  have 


Pr  does  not  satisfy  (StOC-1)}^  <  4/cexp 


(e  —  \/k  u)2  \ 
16(2  +  a-1) 2  pi1)' 


Proof.  The  proof  of  this  lemma  relies  heavily  on  the  so-called  method  of  bounded 
differences  (MOBD)  [190].  Specifically,  we  begin  by  noting  that  ||(<E»IlI$n  —  I)z\\  = 

maxj  |  V-Kj)  |  •  Therefore  for  a  fixed  index  i,  and  conditioned  on  the  event 

Ai>  ==  {7 q  —  i1},  we  have  the  following  equality  from  basic  probability  theory 


Pr 


e  z 


A  )  =  Pr  (  |  z3  (Pi' ,  Pit,  )  |  >  c\\z\ 


3= 1 
3  & 


Ai^j . 
(11.3.4) 


Next,  in  order  to  apply  the  MOBD  to  obtain  an  upper  bound  for  (11.3.4),  we  first 
define  a  random  (k  —  l)-tuple  =  (7Ti, . . . ,  7r*_i,  7q+i, . . . ,  7 q,)  and  then  construct  a 
Doob  martingale  (Z0,  Z\, . . . ,  Zk-i)  as  follows: 


Zo  =  E  'Yhzj{vi>,vVj) 


3= 1 

3# 


A;/ 


,  and 


(11.3.5) 


Ze  =  E  y 7 rAe,Ai>  ,  1=  1,. . .  ,k  -  1 


3= 1 

3# 


where  7T1^  denotes  the  first  t  elements  of  n  1 .  The  hrst  thing  to  note  here  is  that 
we  have  from  the  linearity  of  (conditional)  expectation 


^2Z3^[(lPi',^j)\Ai'] 


jC 
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(a) 

< 


E 


E 

9=1 

q^i' 


1 


N  -l 


{Vi'lVq) 


(b) 

—  V  IM|l 


<  \fk 


v  \z\ 


where  (a)  follows  from  the  fact  that,  conditioned  on  At>,  7ij  has  a  uniform  distribution 
over  [iV]  —  {*/},  while  (6)  is  mainly  a  consequence  of  the  definition  of  average  coherence. 
In  addition,  if  we  use  ziy-*  to  denote  the  fbth  element  of  ITE*  and  dehne 


Zf(r)  =  E 


3= 1 


7T 


7f/  =  r,  Ai> 


(11.3.6) 


for  £  =  1, . . . ,  k  —  1  then,  since  (Z0,  Z 1} . . . ,  Zk-i)  is  a  Doob  martingale,  it  can  be 
easily  verified  that  \Zg  —  Z^_i|  is  upper  bounded  by  suprs  [Zg(r)  —  Zg(s)\  (see,  e.g., 
[195]). 

Now  in  order  to  upper  bound  supr  s  [Zg(r)  —  Zg(s)\ ,  notice  that  we  can  bound  \Zg(r)  — 

Zi(s)\: 


zi(r )  -  Z€(s) 


X^'(E 


< 


E 


*  =  r,  A' 

(</V,  Vtt,-)  =  E  A' 


E 


-E 


=  S,  A' 

{<Pi',<Pnj)  Cf-rV  =  s,A' 


=  d-i  , 


|  %  1 1  dt,j  |  +  |  ^7 1 1  dg 


Jt 


(11.3.7) 


j¥=i 


j>e+i 


In  addition,  we  have  that  for  every  j  >  l  +  1,  j  7^  z,  the  random  variable  tyj  has 
a  uniform  distribution  over  [N]  —  {7 r,  z7}  when  conditioned  on  {7t]^£_1,  7qT*  = 
r,  z7},  whereas  7Tj  has  a  uniform  distribution  over  [TV]  —  { 7rf_!^_i,  s,  z7}  when  conditioned 
on  {7r^^_i,  TTg1  =  s,  z7}.  Therefore,  we  get  V  j  >  £  +  1,  j  ^  z, 


I  dg,j 


1 

N-e- 1 


{(Pi',tPr)  -  {<PV,V>b) 


<  2h~ 

-  N-£-  1 


Similarly,  it  can  be  shown  that 


(11.3.8) 


l^ll^yl  <  |^+i|2/iwheni<£,  |zj|  I^jI  <  |^|2/i  whenz  =  £  +  1, 
j<r+i  i<^+i 
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and  ^2j<i+i  \zj  \  \  dej\  <  {\z(\  +  wpen  *  >  &  +  1-  Consequently,  regardless  of 

j¥=i 

the  initial  choice  of  i,  we  obtain 


sup  [Zt{r)  -  Zt{s)\  <2fj,(  \ze\  +  \zt+1\  + 

r,s  ' 

s. 


1 

N-e-i 


=  de 


E  i^O' 

j>£+i 


(11.3.9) 


We  have  now  established  that  (Z0,  Z1} . . . ,  Zk_fi)  is  a  (real-  or  complex-valued)  bounded- 
difference  martingale  sequence  with  \Z(  —  Z^_f\  <  2/ide  for  i  —  1, . . . ,  k  —  1.  Therefore, 
under  the  assumption  that  k  <  e2v~2  and  since  it  has  been  established  in  (11.3.5) 
that  \Z0\  <  a Jkv  ||;z||2,  it  is  easy  to  see  that 

(  k 

Pr  (  \  \  >  ell2ll2 

V  j=1 
1+ 


where  (c)  follows  from  the  complex  Azuma  inequality  for  bounded-difference  mar¬ 
tingale  sequences  (see  Theorem  2.16  in  Chapter  2).  Further,  it  can  be  established 
through  routine  calculations  from  (11.3.9)  that  —  (2  -F  a^1  )2 1| ^ ||  1  since 

k  <  N/(l  +  a).  Combining  all  these  facts  together,  we  finally  obtain  that 


Ai>  |  <  Pr  (  |Zfc_i  -  Z0\  >  ellzIU  -  Vk 


"  z  2 


Ai> 


(c) 

<  4  exp  — 


(e  —  Vk  v) 


211  —  II 2 


16/i2  dj 
i=  i 


(11.3.10) 


(d) 


Pr  (  || ($n$n  —  p)z||00  >  4zh  <  k  Pr  (  |  ^%-(^7r.,^.)|  >  e||z 


l=i 
1  +’>' 


1 


k  \  >  eH2l 


l=i 

1 


(e) 


A%'  Pr  {Ax')  <  4 A;  exp 


(e  —  \fk  v)2 
16(2  +  a-1)2//2 


where  (d)  follows  from  the  union  bound  and  the  fact  that  the  nfs  are  identically 
(though  not  independently)  distributed,  while  (e)  follows  from  (11.3.10)  and  the  fact 
that  Hi  has  a  uniform  distribution  over  [N].  □ 

Lemma  11.9.  Let  II  =  (7Ti, . . . ,  7 rfi)  and  IIC  =  (7Tfc+i, . . . ,  np)  denote  the  first  k  and 
the  last  (. N  —  k )  elements  of  a  random  permutation  of  [IV] ,  respectively,  and  choose 
a  parameter  a  >  1.  Then,  for  any  e  G  [0, 1),  k  <  min  |e2z/~2,  (1  +  a)_1./V}7  and  fixed 
z  G  Ck,  we  have 

Pr  does  not  satisfy  (StOC-2)}  j  <  4 (N  —  k )  exp 


/  (e  -  \fkv)2  \ 

v  8(1  +  0  V/ 
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Proof.  The  proof  of  this  lemma  is  very  similar  to  that  of  Lemma  11.8  and  also  relies  on 
the  MOBD.  To  begin  with,  we  note  that  ||$^^n^||  =  maxie[jv-fc]  22  Zj(p p^)  , 

3 

where  [N  —  k]  =  {1, . . . ,  TV  —  k)  and  denotes  the  i-th  element  of  flc.  Then  for  a 
fixed  index  i  e  [TV  —  k],  and  conditioned  on  the  event  Ay  =  {nf  =  ?7},  we  again  have 
the  following  simple  equality 

(  k 

=  Pr  f  |  >  t\\zh  A 

'  i= i 

(11.3.11) 

Next,  as  in  the  case  of  Lemma  11.8,  we  construct  a  Doob  martingale  sequence 
(Z o,  Z i, . . . ,  Zk)  as  follows: 

k 

Z0  =  E  [  ^2  zj{py,,  Pnj )  A  and 

3= 1 
k 

Zg  =  E  ^  ^2  Zj{<Pi',  P-Kj)  rc  i^g,  Ai>  ,  i  =l,...,k 

3=1 

where  7Ti_^  now  denotes  the  first  l  elements  of  fl.  Then,  since  7 Tj  has  a  uniform 
distribution  over  [TV]  —  {i'}  when  conditioned  on  Ay,  we  once  again  have  the  bound 
\Z0\  <  \fk  v  \\z\\2-  Therefore,  the  only  remaining  thing  that  we  need  to  show  in  order 
to  be  able  to  apply  the  complex  Azuma  inequality  to  the  constructed  martingale 
(Z o,  Z i, . . . ,  Zk)  is  that  \Zg  —  Zg-i\  is  suitably  bounded. 

In  this  regard,  we  make  use  of  the  notation 

k 

Ze(r)  =  E[  y  Zj{<Pi',  P^)  n i  =  r,  A] 

3=1 

and  note  that  | Zg(r)  —  Zg(s) |  can  be  bounded  as 


Ze(r)  -  Zt(s) 

Y,  z3  (  E  (pv,  pn.)  7T!^_i,  ne  =  r,  Av  -  E  (py,  p^)  ^g  =  s,  Ay 

3  ' 

{Pi',Pr)-(Pi>,Ps)\ 

<  \zg\  {py,pr)  -  (Pi',Ps)  + - N  - ~y\zP  -  2^ 

j>e 


i  i  ,  zi 

w  +  N-e-i 


(11.3.12) 
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which  implies  that  supr  s  [Zg(r)  —  Zg(s)\  <  2  /idg,  l  =  1 , . . .  ,k.  Consequently,  we 
have  now  established  that  (Zq,  Zi,  . . , ,  Z^)  is  a  bounded-difference  martingale  with 
\Zg  —  Zg_ i|  <  2 [idg.  Therefore,  since  k  <  e2v~2  and  \Zq\  <  \fkv  \\z\\2,  we  once  again 
have  from  the  complex  Azuma  inequality  that 


Pr  >  4zh  <  Pr  Zk  -  Z0\  >  e||z||2  -  Vku\\z\\^ 

{e-Vkuf 


(a) 

<  4  exp 


8(1  +  a-1)  V 


•At' 

(11.3.13) 


where  (a)  follows  by  noting  that  Yle=i^e  —  (1  +  a  1)2|lzll!  since  k  <  N/(l  +  a). 
Combining  all  these  facts  together,  we  finally  obtain  the  claimed  result  as  follows 


Pr  >  ellzl 


(&)  /, 

<  (N  —  k)  Pr  f  |  2j^i(^7rc,^7r.)|  >  e||z||2 

'  j= i 

N  ,  k 

<  (N  —  k)  Pr  (  |  zj  {(fit ,  v?7Tj- )  |  >e\\z 

V— 1  '  j  1 


Ai>  Pr  (Ait') 


(c) 

<  4  (IV  —  k)  exp 


\fk  v) 


8(1  +  a-1)2/!2 


(11.3.14) 


where  ( b )  follows  from  the  union  bound  and  the  fact  that  the  7if ’s  are  identically 
(though  not  independently)  distributed,  while  (c)  follows  from  (11.3.13)  and  the  fact 
that  Tif  has  a  uniform  distribution  over  [N],  □ 


Note  that  Lemma  11.8  and  Lemma  11.9  collectively  prove  through  a  simple  union 
bound  argument  that  an  M  x  N  design  matrix  $  satisfies  (k,  e,  <5)-StOC  for  any 

e  G  [0, 1)  with  5  <  AN  exp  16(2+0^ v2 )  f°r  any  a  —  as  l°ng  as  we  have  that 
k  <  min  je2z/~2,  (1  +  a)_1IV}.  We  are  now  ready  to  provide  a  proof  of  Theorem  11.2. 


Proof  of  Theorem  11.2 

We  begin  by  making  use  of  the  notation  developed  at  the  start  of  this  section  and  writ¬ 
ing  the  signal  proxy  at  =  <&/  as  at  =  <f><E»n z+$>TeM-  Now,  let  Lb  =  (vr^.+i, . . . ,  ttn) 
denote  the  last  ( N  —  k )  elements  of  II  and  note  that  we  need  to  show  that  ||anc||oo  A  A 
and  miiijgi!  la^J  >  A  in  order  to  establish  that  S  =  S. 

In  this  regard,  we  first  assume  that  $  satisfies  (k,  e,  h)-StOC  and  define 

Ae  =  max  |^e||z||2,  ^-^-2a/(T2  log  IV  j 
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for  any  t  G  (0,1).  Next,  it  can  be  verified  through  Theorem  2.13  in  Chapter  2 
that  e~M  =  #Tejvf  satisfies  ||eM-||oo  <  2a/u2  log  N  with  probability  exceeding  1  — 
2(\J2t:  log  N  ■  IV)-1.  Now  define  the  probability  event 

Q  =  1 satisfies  (StOC-1)  and  (StOC-  2»}n{  ||ejvf  ||oo  <  2a/u2  log  AT | 

(11.3.15) 


and  notice  that  we  have  Pr(£?)  >  1  —  5  —  2(\/2i r  log  IV  •  N)  l.  Further,  conditioned 
on  the  event  Q,  we  have 


|«n= 


(a) 
OO  — 


+  W&nceM 


(*>) 

OO  —  ^  I 


(c) 


+  2y/a2logN  <Xe  (11.3.16) 


where  (a)  follows  from  the  triangle  inequality,  (6)  is  mainly  a  consequence  of  the 
conditioning  on  the  event  Q,  and  (c)  follows  from  the  definition  of  Ae.  Next,  we  define 
r  =  (<l>n<l?n  —  I)z  and  notice  that,  conditioned  on  the  event  Q ,  we  have  for  any  i  G  [k] 
the  following  inequality: 


I 7t i |  | T  T  7i-,:  ^  \%i\  Halloo  ||®Jw||oo 

(d)  , -  (e) 

>  \\a\\min- e\\z\\2-2y/aHogN  >  ||a||min  -  Ae .  (11.3.17) 

Here,  (d)  follows  from  the  conditioning  on  Q,  while  (e)  is  a  simple  consequence  of 
the  choice  of  Ae.  It  can  therefore  be  concluded  from  (11.3.16)  and  (11.3.17)  that  if 
$  satisfies  (k,e,5)~ StOC  and  the  OST  algorithm  uses  the  threshold  Ae  then  we  have 
Pr(5  ^  S)  <  Pr (Qc)  as  long  as  1 1 ck 1 1 min  >  2Ae. 

Finally,  to  complete  the  proof  of  this  theorem,  we  let  k  <  M/(21og  N)  and  fix  e  = 
10/1^2  log  N.  Then  the  claim  is  that  $  satisfies  (k,e,S)~ StOC  with  <5  <  41V-1.  In 
order  to  establish  this  claim,  we  only  need  to  ensure  that  the  chosen  parameters  satisfy 
the  assumptions  of  Lemma  11.8  and  Lemma  11.9.  In  this  regard,  note  that  (i)  e  <  1 
because  of  (CP-1),  and  (ii)  \fk  v  <  |  because  of  the  assumption  that  k  <  M/(2  log  N) 
and  (CP-2).  Therefore,  since  the  assumption  N  >  128  together  with  k  <  M / (2  log N) 

implies  that  16(2  +  a-1)2  <  72,  we  obtain  exp  i6(2+a-q-/).2 )  —  -^~2-  We  can 
now  combine  this  fact  with  the  previously  established  facts  to  see  that  the  threshold 
A  =  max  1 1 10 \i\[M  ■  SNR,  jW^/2  j  sj2o 2  log  N  guarantees  that  Pr(<S  ^  S)  <  6 IV-1 
as  long  as  M  >  2k  log  N  and  ||a||min  >  2A.  Finally,  note  that 

||a||min  >  4 \J a2  log N  -<=>■  M  >  c‘lk^°&N 

1  -  t  SNRmin 


and 


|  OK  1 1  min  >  -20/J,\/2  M(T2  log  N  -SNR  M  > 


c3k  log  N\  7^2 
MAR  / 


This  completes  the  proof  of  the  theorem. 
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Proof  of  Theorem  11.5 


We  begin  by  making  use  of  the  notation  developed  earlier  in  this  section  and  condi¬ 
tioning  on  the  event  Q  defined  in  (11.3.15)  with  e  =  10/i\/2  log  N.  Then  it  is  easy  to 
see  from  the  proof  of  Theorem  11.2  that  the  estimate  S  is  a  subset  of  S  because  of 
the  fact  that  ||dnc||oo  <  A. 


Next,  assume  without  loss  of  generality  that  Zi  =  oW  and  note  from  (11.3.17)  that 
A  for  any  i  G  {1, . . . ,  k}.  Then,  since  7p  G  S  if  and  only  if  (cprj  >  A, 


1*5^1  >  |®p)| 

we  have  that 


a 


(0 


>  2A 


7Tj  G  S.  Now  define  L  to  be  the  largest  integer  for  which 


a 


(£) 


>  2A  holds  and  note  that 


a 


(L) 


>  2A 


«(*)  >  2A 


7Tj  G  S  for  every 


i  G  {1, . . . ,  L},  which  in  turn  implies  |<S  —  «S|  <  (k  —  L).  Finally,  note  that 


> 


1  -  / 


A:\J c r2  log  AT 


larl  > 


c2k  log  N 
M  ■  SNR 


and 


\a 


(£) 


>  -20fi\/2na2  logA^  •  SNR 


larl  > 


c'3k  log  N 


-2 


This  completes  the  proof  of  the  theorem  since  the  event  Q  holds  with  probability 
exceeding  1  —  61V-1. 


11.4  Near-Optimal  Design  Matrices  for  One-Step 
Thresholding:  Some  Examples 

Section  11.2  establishes  that  design  matrices  with  small  worst-case  coherence  (and 
consequently  small  average  coherence)  are  particularly  well-suited  for  model  selection 
and  recovery  of  sparse  signals  using  OST.  Moreover,  in  the  next  chapter  we  will 
see  the  implications  of  the  spectral  norm  on  the  uniqueness  of  sparse  representation. 
Further,  since  the  Welch  bound  [251]  dictates  that  /i  ^3  M-1/2  for  IV  1  and  since  we 

have  from  elementary  linear  algebra  that  ||<&||2  >  we  are  particularly  interested 

in  design  matrices  that  approximately  satisfy  the  scaling  relations  n(K &)  x  M-1//2, 
z/($)  33  M-1,  and  ||$||2  x  In  the  following,  we  provide  some  examples  of  both 

random  and  deterministic  design  matrices  that  are  nearly-optimal  in  terms  of  these 
requisite  conditions  (also,  see  Table  11.1  for  an  overview  of  the  results  reported  in 
here). 
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11.4.1  Random  Design  Matrices 


Random  matrices  are  perhaps  the  most  well-studied  design  matrices  in  the  literature 
on  high- dimensional,  linear  inference  problems.  This  is  in  part  due  to  the  fact  that 
geometric  concepts  such  as  the  irrepresentable  condition  [261]  and  the  restricted  isom¬ 
etry  property  (RIP)  [49]  have,  to  date,  been  shown  to  hold  near-optimally  only  for  the 
case  of  random  matrices.  The  following  two  lemmas  make  precise  the  intuition  that 
traditional  random  design  matrices  such  as  Gaussian  matrices  and  (random)  partial 
Fourier  matrices  also  tend  to  be  near-optimal  in  terms  of  the  geometric  measures  of 
/i,  u,  and/or  ||4>||2. 

Lemma  11.10  (Geometry  of  Gaussian  Matrices).  Let  4>  be  an  M  x  N  design  matrix 
with  independent  and  identically  distributed  (i.i.d.)  A/"(0, 1/M)  entries  and  let  M  > 

60  log IV.  Then,  we  have  that  $  satisfies  (i)  //(<&)  <  ,  (H)  ^(4?)  <  v/15J^gJV, 

and  (Hi)  ||4>||2  <  1  +  2 with  probability  exceeding  1  —  2(N~1  +  N~ 2  +  e~N Z2).2 

Note  that  the  worst-case  coherence  bound  in  this  lemma  follows  from  bounds  on  the 
inner  product  of  independent  Gaussian  vectors  (see,  e.g.,  [17,  Appendix  A])  and  a 
simple  union  bound  argument,  the  proof  of  the  average  coherence  bound  is  provided 
in  [17,  Lemma  2],  and  the  spectral  norm  bound  follows  from  [219,  (2.3)].  It  is  worth 
pointing  out  here  that  similar  results  can  also  be  obtained  for  sub-Gaussian  design 
matrices  using  standard  concentration  inequalities  and  [219,  Proposition  2.4], 

Lemma  11.11  (Geometry  of  Partial  Fourier  Matrices).  Let  U  be  an  N -point  (non- 
normalized)  discrete  Fourier  transform  matrix  such  that  UTU  =  NI.  Next,  populate 
by  sampling  M  times  with  replacement  from  the  set  {1, . . . ,N}  and  construct  4> 
by  collecting  the  rows  of  U  corresponding  to  the  indices  in  G  and  normalizing  the 

resulting  matrix  by  1  j\[M.  Then  4>  satisfies  (i)  /i(4>)  <  /AF  «»<*  (»)  "(*)  < 
max  jy/iy,  yfpvyy  |  with  probability  exceeding  1  —  2N~X . 

In  this  lemma,  the  worst-case  coherence  bound  follows  by  noting  that  the  columns  of 
U  form  a  group  under  pointwise  multiplication  and  then  making  use  of  Hoeffding’s 
inequality  [38].  On  the  other  hand,  the  average  coherence  expression  in  it  follows 
from  the  definition  of  the  average  coherence  and  the  fact  that  1  is  in  the  null  space 
of  any  partial  Fourier  matrix  that  does  not  include  the  first  row  of  U.  Finally,  note 
that  the  fact  that  sampling  in  Lemma  11.11  is  carried  out  with  replacement,  which 
makes  it  difficult  to  specify  the  spectral  norm  of  U .  In  practice,  however,  one  would 
not  construct  partial  Fourier  matrices  with  identical  rows  and  the  spectral  norm  of 

partial  Fourier  matrices  in  such  cases  would  be  for  the  simple  reason  that  the 
rows  of  U  are  mutually  orthogonal. 

2Note  that  the  results  (and  the  definition  of  the  coherence  property)  presented  earlier  remain 
valid  if  /i(4*)  is  replaced  with  an  upperbound  /i(4>). 
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Table  11.1:  Comparisons  between  different  classes  of  random  and  deterministic  design 
matrices.  All  bounds  ignore  the  OQ  constants. 


Matrix 

N 

**(*) 

4*1  2 

Randomness 

Complexity 

Gaussian  Matrices 

- 

/log  N 

V  M 

\/log  N 

M 

MN 

MN 

Partial  Fourier 

Matrices 

- 

/log  TV 

V  M 

N-M 

M(N-l) 

- 

M  log  N 

N\ogN 

Alltop  Gabor 

Frames 

M 2 

1 

\fM 

1 

M 

sfi 

- 

N\ogN 

Discrete-Chirp 

Matrices 

M2 

1 

Vm 

N-M 

M(N-l) 

sfi 

- 

N\ogN 

Dual  BCH 

Sensing  Matrices 

N2 

[~2 

V  m 

N-M 

M(N-l) 

\ii 

— 

N\ogN 

Delsarte-Goethals 

Frames 

M2+r 

2r 

Vm 

1 

N- 1 

s 

- 

N\ogN 

11.4.2  Deterministic  Design  Matrices 

Having  described  the  geometry  of  Gaussian  matrices  and  partial  Fourier  matrices,  we 
now  show  that  there  in  fact  exist  many  classes  of  deterministic  design  matrices  that 
are  quite  similar  to  these  random  design  matrices  in  terms  of  the  geometric  measures 
of  g,u,  and  ||<fr||2.  This  is  in  stark  contrast  to  the  best  known  results  for  the  RIP 
of  deterministic  matrices  and  has  important  implications  from  an  implementation 
viewpoint  since  multiplications  with  the  deterministic  matrices  described  below  (and 
their  adjoints)  can  be  efficiently  carried  out  using  algorithms  such  as  the  fast  Fourier 
transform  (FFT)  and  the  fast  Hadamard  transform  (FHT). 

Geometry  of  Gabor  Frames  and  Its  Implications 

A  (finite)  frame  for  CA/  is  defined  as  any  collection  of  N  >  M  vectors  that  span 
the  M-dimensional  Hilbert  space  CA1 .  Gabor  frames  for  CM  constitute  an  important 
class  of  frames,  having  applications  in  areas  such  as  communications  [18]  and  radar 
[146],  that  are  constructed  from  time-  and  frequency-shifts  of  a  nonzero  seed  vector 
in  CM .  Specifically,  let  g  G  Cw  be  a  unit-norm  seed  vector  and  define  T  to  be  an 
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M  X  M  time-shift  matrix  that  is  generated  from  g  as  follows 


9 1 

9n 

92 

T(g)  = 

92 

9\ 

9n 

(11.4.1) 

_9n 

9n—  1 

9i. 

where  we  write  T  =  T(g)  to  emphasize  that  T  is  a  matrix- valued  function  on 
CA/ .  Next,  denote  the  collection  of  M  samples  of  a  discrete  sinusoid  with  frequency 
27 Tjf,m  G  {0, ...  ,  M  —  1}  as  ojm  =  [e-?2,r^0  . . .  e?'27r^(M_1)]T.  Finally,  define  the 
corresponding  M  x  M  diagonal  modulation  matrices  as  Wm  =  diag(o;m).  Then  the 
Gabor  frame  generated  from  g  is  an  M  x  M'2  block  matrix  of  the  form 

=  [W0T  WiT  . . .  Wm-iT]  .  (11.4.2) 

In  words,  columns  of  the  Gabor  frame  $  are  given  by  downward  circular  shifts  and 
modulations  (frequency  shifts)  of  the  seed  vector  g.  We  are  now  ready  to  state  the 
first  main  result  concerning  the  geometry  of  Gabor  frames,  which  follows  directly 
from  [178]. 

Theorem  11.12  (Spectral  Norm  of  Gabor  Frames  [178]).  Gabor  frames  generated 
from  nonzero  (unit-norm)  seed  vectors  are  tight  frames;  in  other  words,  we  have  that 

||<h||2  = 

Theorem  11.12  implies  that  Gabor  frames  are  the  best  that  one  can  hope  for  in  terms 
of  the  spectral  norm.  The  next  result  that  we  prove  concerns  the  average  coherence 
of  Gabor  frames. 

Theorem  11.13  (Average  Coherence  of  Gabor  Frames).  Let  be  a  Gabor  frame 
generated  from  a  unit-norm  seed  vector  g  G  CM .  Then,  using  the  notation  qma,  = 
rnaxj  \gf\  and  gm in  =  min.;  \gf\,  the  average  coherence  of  can  be  bounded  from  the 
above  as  follows: 


i/($)  < 


M  gmnxfy/M  -  gmin)  +  1-M  g^ 
M 2  -  1 


(11.4.3) 


Proof.  In  order  to  facilitate  the  proof  of  this  theorem,  we  first  map  the  indices  of  the 
columns  of  $  from  {1, . . . ,  M2}  to  C  =  {0, . . . ,  M  —  1}  x  {0, . . . ,  M  —  1}  as  follows 


n  :  i  i — ► 


^(z  mod  M)  —  1, 


(11.4.4) 


In  words,  k{i)  =  (£,m)  signifies  that  the  i-th  column  of  $  corresponds  to  the  (l- \- 1)- 
th  column  of  WmT.  Next,  fix  an  index  i  (resp.  n{i)  =  (£,m))  and  make  use  of  the 
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above  reindexing  to  write 


M2 

^  n(i) ,  <P  n(j))  =  'y  ^  (<-Pn(£,m),  <-P  k(£' ,m' )) 

j= 1  (^m')eC 

M—l  M—l  M—l 

=  J2Y1  <¥>«(*  ,m)  5  <1 PK,(l',m .'))  +  ,m)  9  ^ k(£'  ,m'  )>•  (11.4.5) 

£'=0  m'=0  m'=0 

t'^t  m'^m 

Finally,  note  that  we  can  explicitly  write  the  columns  of  $  using  (11.4.2)  for  any 
(£,  m)  G  C  as  follows 

=  [g(i-i)Mej2n^°  ■■■  g(M-e)Mei2*%iM-1)]T  (11.4.6) 

where  we  use  the  notation  g(q)M  as  a  shorthand  for  gq  mod  M.  The  rest  of  the  proof 
now  follows  from  simple  algebraic  manipulations.  Specifically,  it  follows  from  (11.4.6) 
that  the  first  term  in  (11.4.5)  can  be  simplified  as 


N- 1  N-l 

,m)  i  n(V  ,m!  )) 

£'=0  m'= 0 

v+i 


M  M—l 


M—l 


XI  X  9{g-e)M9{q-e')M  X  e 


427r2sr(m'-m) 


9=1  f=0 
t'+i 

M  M—l 


m'= 0 
M—l 


X  X  9{q-£)M9(q-i')M  X  e 


427r2»r(m'-m) 


+ 


9=2  y=o 
ta+i 

M-l 


m'= 0 


M  X  9\i— t) M9 {!—£') m  =  M9( i-£)m  X  9{i-n 


M-l 


l'= 0 
S!+i 


l'=  0 


(11.4.7) 


where  (a)  in  the  above  expression  is  a  consequence  of  the  fact  that  X!m'=o  &27t9m  <'m'  = 

0  for  any  hxed  g  £  (2,...,  M}.  Likewise,  we  can  simplify  the  second  term  in  (11.4.5) 
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as  follows 


N—l 


Y  ,m)  5 


m'= 0 
m'^m 


M 


M—l 


El  9(g-e)M9(q-t)M  El  e 


9=1 

M 


m'=0 


M—l 


M—l 


Elsc-oJ  E  + 19(,-()„|-  E  1 

q=2 


m'= 0 
m'^m 


m'= 0 
m'^m 


(6) 


M 


9=2 

=  -1  + 


(11.4.8) 


where  (6)  follows  from  the  fact  that  ]Cm'/m  (^27r"M  m">  —  —1  for  any  fixed  g  G 

{2, . . . ,  M}. 

To  conclude  the  theorem,  note  from  (11.4.5),  (11.4.7),  and  (11.4.8)  that  we  can  write 

M2 


max 

ie{l,...,M2} 


l=i 


=  max 

t 


M—l 


m  g*( i~t)M  E  _  1  +  -^k(i-<)j 


(c) 

<  max 

re{l,...,M} 


£'=0 

M 


Mg*r  Yds 


S=1 

s^r 

M 


+  max 
re{l....,M} 


—  1  +  M\gr 


<  M  max  |gr|  V  |g,|  +  1  -  M g2min 

re{l,...,M}  z — ' 

s=l 
s^r 

(d)  _ 

<  Mgmax(\^M  -  gmin)  +  1  -  M  g^in.  (11.4.9) 

Here,  (c)  mainly  follows  from  the  triangle  inequality  and  a  simple  reindexing  argu¬ 
ment,  while  (d)  mainly  follows  from  the  Cauchy-Schwarz  inequality  since  Y^=i  |l7s |  = 

s^r 

<  \[M  —  gm\n-  The  proof  of  the  theorem  now  follows  by  dividing  the  above 
expression  by  M 2  —  1.  □ 


In  words,  Theorem  11.13  states  that  the  average  coherence  of  Gabor  frames  cannot 
be  too  large.  In  particular,  it  implies  that  Gabor  frames  generated  from  unimodal 
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(unit-norm)  seed  vectors  (i.e.,  seed  vectors  characterized  by  gmin  x  gmax  x  M”1/2) 
satisfy  ^  M”1.  On  the  other  hand,  recall  that  the  Welch  bound  [251]  dictates 
that  //($)  >  (M  +  1)”1//2  for  Gabor  frames.  It  is  therefore  possible  to  conclude 
from  these  two  facts  that  Gabor  frames  generated  from  unimodal  seed  vectors  are 
automatically  guaranteed  to  satisfy  the  coherence  property  (resp.  strong  coherence 
property)  as  long  as  y«(< f>)  ^  (log IV)-1/2  (resp.  z/(<I>)  ■<  (log IV)”1).  In  the  context 
of  model  selection  and  sparse-signal  recovery,  Theorem  11.13  therefore  suggests  that 
Gabor  frames  generated  from  unimodal  seed  vectors  are  the  best  that  one  can  hope 
for  in  terms  of  the  average  coherence. 

Finally,  recall  from  the  discussions  in  Section  11.2  that — among  the  class  of  matrices 
that  satisfy  the  coherence  property — design  matrices  with  small  worst-case  coherence 
are  particularly  well-suited  for  model  selection  and  sparse-signal  recovery.  In  the 
context  of  Gabor  frames,  the  goal  then  is  to  design  unimodal  seed  vectors  that  yield 
Gabor  frames  with  the  smallest-possible  worst-case  coherence.  This,  however,  is  an 
active  area  of  mathematical  research  and  a  number  of  researchers  have  looked  at  this 
problem  in  recent  years;  see,  e.g.,  [230].  As  such,  we  can  simply  leverage  some  of 
the  existing  research  in  this  area  in  order  to  provide  explicit  constructions  of  Gabor 
frames  that  satisfy  the  coherence  property  with  nearly-optimal  worst-case  coherence. 

Specifically,  let  M  >  5  be  a  prime  number  and  construct  a  unimodal  seed  vector 
g  G  CN  as  follows 


9  = 


G=ei27rM  -= 
sJm  Vm 


m  — L=e M 


Vm 


■2  (m~ n3 

pZ7r  M 


(11.4.10) 


The  sequence  <J  ^^e-7 


M—l 


9=0 


is  termed  the  Alltop  sequence  [6]  in  the  literature.  This 


sequence  has  the  property  that  its  autocorrelation  decays  very  fast  and,  therefore,  it  is 
particularly  well-suited  for  generating  Gabor  frames  with  small  worst-case  coherence. 
In  particular,  it  was  established  recently  in  [230]  that  Gabor  frames  generated  from 
the  Alltop  seed  vector  g  given  in  (11.4.10)  satisfy 


K®)  =  max.  < 


1 

\[M' 


(11.4.11) 


In  addition,  since  we  have  that  gm in  =  gmax  =  M”1/2  for  the  Alltop  seed  vector,  it  is 
possible  to  conclude  from  Theorem  11.13  that  the  average  coherence  of  Alltop  Gabor 
frames  satisfies  zz(<&)  <  (M  +  l)”1  <  /i(<F)/\/M.  An  immediate  consequence  of  this 
discussion  is  that  all  the  results  reported  in  Section  11.2  in  the  context  of  model 
selection  using  OST  apply  directly  to  the  case  of  Alltop  Gabor  frames. 
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Geometry  of  Discrete-Chirp  Matrices 

An  M-lcngth  chirp  signal  for  any  prime  M  takes  the  form 


(m,r)  (0 


— eW,w+i^ir 

Vm 


0, . . . ,  M  -  1 


(11.4.12) 


where  m  is  the  base  frequency  and  r  is  the  chirp  rate  of  the  signal.  Discrete-chirp 
matrices  are  M  x  M 2  matrices  that  are  constructed  by  collecting  all  possible  chirp 
signals  into  columns  [11],  The  columns  of  the  M  x  M2  discrete-chirp  matrix  $  are  the 
M 2  distinct  chirp  signals  corresponding  to  the  M 2  possible  pairs  (m,r)  G  7LM  X  ZM. 
The  following  lemma  characterizes  the  geometry  of  discrete-chirp  matrices. 


Lemma  11.14.  Let  $  be  an  M  x  M 2  discrete- chirp  matrix  for  any  prime  M .  Then 


$ 


satisfies  (i)  //($)  =  (ii)  v{$>)  =  M{N_iy  and  (in)  ||$||2 


m?.  =  t/4. 


M  ' 


Here,  the  worst-case  coherence  bound  and  the  spectral  norm  expression  follow  from 
[64],  while  the  average  coherence  expression  follows  from  the  fact  that  4>4?1  = 
j^l.  Finally,  note  that  the  structure  of  the  discrete-chirp  matrix  implies  that  the 
multiplications  and  4>Tw  can  be  carried  out  using  the  FFT  in  0(N log  N)  time. 


Geometry  of  Dual  BCH  Sensing  Matrices 

Dual  BCH  sensing  matrices  constitute  another  class  of  design  matrices  that  cor¬ 
responds  to  exponentiating  the  codewords  of  an  algebraic  code.  Specifically,  take 
m  €  Z+  to  be  an  odd  number  and  use  BCH(m,  2)  to  denote  the  extended  2-error 
correcting,  binary  BCH  code  of  length  M  =  2m  [183].  Then  the  dual  of  BCH(m,  2) 
is  a  code  of  length  M  and  dimension  2 m  +  1  that  is  the  union  of  M  cosets  of  the 
first-order  Reed-Muller  code  RM(l,m)  of  dimension  m  +  1.  The  important  thing 
to  point  out  here  is  that  exponentiating  codewords  in  the  dual  of  BCH(m,  2)  and 
scaling  the  resulting  M  x  M 2  matrix  $  by  1/yfM  gives  a  union  of  M  orthonormal 
basis.  This  can  be  seen  by  noting  that  exponentiating  codewords  in  RM(l,m )  gives 
Walsh  basis  vectors  (and  their  negatives,  which  we  discard  in  here).  We  also  note 
because  of  the  very  same  reason  that  the  multiplications  and  4>Tu  in  the  case 
of  dual  BCH  sensing  matrices  can  also  be  carried  out  using  the  FHT  in  ON  log  N) 
time.  The  following  lemma  characterizes  the  geometry  of  dual  BCH(m,  2)  sensing 
matrices. 

Lemma  11.15.  Let  $  be  an  M  x  M 2  dual  BCH  sensing  matrix  obtained  from  the 
dual  of  BCH(m,2)  for  some  odd  m.  Then  the  matrix  $  satisfies  (i)  //(<£)  = 

(ii)  z/($)  =  and  (i{i)  II ^ II 2  =  \[i- ■ 
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Remark  11.16.  In  this  section,  we  introduced  deterministic  design  matrices  with 
optimal  spectral  norm  ||<&||2  =  \p^j,  and  worst-case  coherence  //(<£)  A  M-0'5.  The 
introduced  matrices  also  satisfy  the  coherence  property  as  ^  M~l  ^ 

However,  these  matrices  suffer  from  the  restriction  that  each  matrix  can  have  at  most 
M 2  columns  (i.e.,  N  <  M2).  In  the  next  chapter,  we  will  introduce  the  Delsarte- 
Goethals  frames  (DG(m,  r ) )  as  another  family  of  design  matrices  with  optimal  spectral 

norm  (<fr  =  and  worst-case  coherence  (ji  <  ^=).  We  will  see  that  the  Delsarte- 

Goethals  frames  also  have  with  much  smaller  average  coherence  )  —  JV— 1  ^  M’ 
and  much  larger  number  of  columns  N  =  Mr+2 . 


11.5  Conclusion 

In  this  chapter,  we  have  revisited  two  variants  of  the  often  forgotten  but  extremely  fast 
one-step  thresholding  (OST)  algorithm  for  model  selection.  One  of  the  key  insights 
offered  by  the  chapter  in  this  regard  is  that  polynomial-time  model  selection  can  be 
carried  out  even  when  signal  reconstruction  (and  thereby  the  lasso)  fails.  In  addition, 
we  have  established  in  the  chapter  that  if  the  M  x  N  design  matrix  $  satisfies 
/i($)  x  M"1/2  and  i/($)  ^  M  1  then  OST  can  perform  near-optimally  for  the  case 
when  either  (i)  the  minimum-to-average  ratio  (mar)  of  the  signal  is  not  too  small 
or  (ii)  the  signal-to-noise  ratio  (snr)  in  the  measurement  system  is  not  too  high.  It 
is  worth  pointing  out  here  that  some  researchers  in  the  past  have  observed  that  the 
sorted  variant  of  the  OST  (SOST)  algorithm  at  times  performs  similar  to  or  better 
than  the  lasso  (see  Fig.  11.2  for  an  illustration  of  this  in  the  case  of  an  Alltop  Gabor 
frame  in  C127).  One  of  our  main  contributions  in  this  regard  is  that  we  have  taken 
the  mystery  out  of  this  observation  and  explicitly  specified  in  the  chapter  the  four 
key  parameters  of  the  model-selection  problem,  namely,  /i($),!/($),MAR,  and  SNR, 
that  determine  the  non-asymptotic  performance  of  the  SOST  algorithm  for  generic 
(random  or  deterministic)  design  matrices  and  data  vectors  having  generic  (random 
or  deterministic)  nonzero  entries;  also,  see  [120]  for  a  comparison  of  our  results  with 
corresponding  results  recently  reported  in  the  literature. 

The  second  main  contribution  of  this  chapter — which  completely  sets  it  apart  from 
existing  work  on  thresholding  for  model  selection — is  that  we  have  proposed  and  ana¬ 
lyzed  a  model-order  agnostic  threshold  for  the  OST  algorithm.  The  significance  of  this 
aspect  of  the  chapter  can  be  best  understood  by  realizing  that  in  real-world  applica¬ 
tions  it  is  often  easier  to  estimate  the  SNR  and  the  noise  variance  in  the  system  than  to 
estimate  the  true  model  order.  In  particular,  we  have  established  in  the  chapter  that 
the  threshold  A  =  max  j^lO p,y/M  ■  SNR,  G-_ y/2  j  ‘la2  log  N  for  t  G  (0, 1)  enables  the 
OST  algorithm  to  carry  out  near-optimal  partial  model  selection.  Fig.  11.3  reports  the 
results  of  an  experiment  concerning  partial  model-selection  performance  of  the  OST 
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(a)  Plots  of  the  fraction  of  detections ,  defined  as  fn  =  |<SnS|/fc,  and  the  fraction  of 
false  alarms ,  defined  as  fpA  =  (|<S|  —  |<S  fl  <S|)/|«S| ,  versus  the  model  order  (averaged 
over  200  independent  trials)  for  both  SOST  and  the  lasso. 


(b)  Plots  of  the  amount  of  time  (averaged  over  200  independent  trials)  that  it  takes 
SOST  and  the  lasso  to  solve  one  model-selection  problem  versus  the  model  order. 

Figure  11.2:  Numerical  comparisons  between  the  performance  of  the  SOST  algorithm 
(Algorithm  9)  and  the  lasso  [233]  using  an  Alltop  Gabor  frame.  The  M  x  N  design 
matrix  $  has  dimensions  M  =  127  and  N  =  M 2,  the  MAR  of  the  signals  is  1,  the  SNR 
in  the  measurement  system  is  10  dB,  and  the  noise  variance  is  cr2  =  10-2. 
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Figure  11.3:  Partial  model-selection  performance  of  the  OST  algorithm  (averaged 
over  200  independent  trials)  corresponding  to  an  Alltop  Gabor  frame  in  C99'.  The 
MAR  of  the  signals  in  this  experiment  is  1,  the  SNR  in  the  measurement  system  is 
3  dB,  and  the  noise  variance  is  a2  =  10~2. 


algorithm  in  terms  of  the  metrics  of  fraction  of  detections ,  fo  =  and  fraction  of 

false  alarms ,  /fa  =  averaged  over  200  independent  trials.  In  this  experiment, 

l‘5| 

the  M  X  N  design  matrix  $  corresponds  to  an  Alltop  Gabor  frame  in  C997,  the  noise 
variance  is  a2  =  10~2,  the  mar  and  the  SNR  are  chosen  to  be  1  and  3  dB,  respectively, 
and  the  initial  threshold  is  set  at  As  =  max  j  jd  M  ■  SNR,  y/2  j  a/ 2 a2  log  N  with 

t  =  (a/2  —  1) / a/2  and  d  =  2 1.  It  can  clearly  be  seen  from  Fig.  11.3  that  OST  success¬ 
fully  carries  out  partial  model  selection  (/fa  =  0)  even  when  the  threshold  is  set  at 
0.6AS. 
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Chapter  12 

Reed-Muller  Based  Compressed 
Sensing 


In  the  previous  chapter,  we  introduced  two  fundamental  measures  of  coherence  be¬ 
tween  the  columns  of  a  sensing  matrix,  and  showed  that  if  the  matrix  satisfies  a 
coherence  property ,  then  a  simple  One-Step  Thresholding  algorithm  can  successfully 
recover  most  fc-sparse  vectors.  In  this  chapter,  we  introduce  the  spectral  norm  of 
the  sensing  matrix  as  a  measure  of  coherence  between  the  rows  of  the  matrix,  and 
show  that  if  a  sensing  also  has  sufficiently  small  spectral  norm,  then  most  fc-sparse 
vectors  have  unique  representations  in  the  measurement  domain.  This  further  im¬ 
plies  that  reconstruction  algorithms  such  as  LASSO  or  OST  not  only  can  successfully 
recover  the  supports  of  most  /c-sparse  vectors,  but  are  also  capable  of  providing  close 
approximations  to  those  vectors. 

The  coherence  between  rows  of  a  sensing  matrix  is  a  measure  of  the  new  informa¬ 
tion  provided  by  an  additional  measurement.  The  spectral  norm  ||$||2  measures  the 
maximal  coherence  between  the  rows  of  the  frame.  The  ideal  case  is  when  different 
measurements  are  orthogonal.  Then,  provided  that  the  matrix  also  has  sufficiently  low 
worst-case  coherence,  with  high  probability  a  fc-sparse  vector  has  a  unique  sparse  rep¬ 
resentation  [240],  and  this  representation  can  be  efficiently  recovered  using  a  LASSO 
program  [50]. 

In  this  chapter,  we  consider  sensing  matrices  based  on  the  Zb-linear  representation 
of  Delsarte  Goethals  codes.  The  columns  are  obtained  by  exponentiating  codewords 
in  the  quaternary  Delsarte-Goethals  code;  they  are  uniformly  and  very  precisely  dis¬ 
tributed  over  the  surface  of  an  M-dimensional  sphere.  Coherence  between  columns 
reduces  to  properties  of  these  algebraic  codes.  Section  12.2.1  reviews  the  construc¬ 
tion  of  Delsarte-Goethals  (DG)  sets  of  Z^-linear  quadratic  forms  which  is  the  starting 
point  for  the  construction  of  the  corresponding  codes;  each  quadratic  form  determines 
a  codeword  where  the  entries  are  the  values  taken  by  quadratic  form.  Section  12.2.1 
introduces  Delsarte-Goethals  frames;  the  columns  of  these  sensing  matrices  are  ob- 
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tained  by  exponentiating  DG  codewords.  We  then  determine  the  worst  case  coher¬ 
ence,  average-coherence,  and  spectral  norm  for  these  sensing  matrices. 

Candes  and  Plan  [50]  specified  coherence  conditions  under  which  a  LASSO  program 
will  successfully  recover  a  fc-sparse  signal  when  the  k  non- zero  entries  are  above  the 
noise  variance.  Similarly,  in  Theorems  11.2  to  11.5  we  proved  similar  coherence  con¬ 
ditions  for  successful  recovery  of  the  OST  algorithm.  We  use  these  results  to  provide 
an  average  case  error  analysis  for  stochastic  noise  in  both  the  data  and  measurement 
domains.  The  Delsarte  Goethals  (DG)  sensing  matrices  are  essentially  tight  frames  so 
that  white  noise  in  the  data  domain  maps  to  white  noise  in  the  measurement  domain. 

Section  12.3  presents  the  results  of  numerical  experiments  that  compare  DG  frames 
with  random  Gaussian  matrices  of  the  same  size.  The  SPGL  package  [244,  243]  is  used 
to  implement  the  LASSO  recovery  algorithm  in  all  cases.  It  turns  out  that  the  DG 
frames  have  almost  identical  performance  to  random  matrices  in  terms  of  probability 
of  successful  sparse  recovery,  but  in  contrast  to  random  matrices,  DG  frames  do  not 
suffer  from  storage  and  computational  limitations.  These  matrices  have  deterministic 
constructions,  and  matrix-vector  multiplications  and  can  be  done  efficiently 
using  the  Fast  Hadamard  Transform. 

We  remark  that  there  are  alternative  fast  reconstruction  algorithms  that  exploit  the 
structure  of  DG  sensing  matrices.  The  Chirp  Reconstruction  algorithm  proposed  in 
[45,  150,  46]  requires  only  0[kM  log2  M )  operations,  independent  of  the  data-domain 
dimension  N,  and  is  known  to  work  extremely  well  in  the  presence  of  noise  as  long 
as  the  sparsity  level  is  not  too  high  (See  [46]  for  further  discussion  of  the  Chirp 
Reconstruction  algorithm). 


12.1  Sparse  Reconstruction  using  Incoherent  Tight- 
Franies 


In  Definition  2.5  of  Chapter  2  we  showed  that  an  MxN  dictionary  $  that  satisfies  the 
condition  =  jjImxm  is  a  tight-frame  with  redundancy  Also  in  Section  3.4.2 
we  saw  that  the  mutual  coherence  of  any  MxN  dictionary  is  at  least  ^7=  [251]. 
Designing  dictionaries  with  small  spectral  norms  (tight  frames  in  the  ideal  case),  and 
with  small  coherence  (jjl  =  O  (717)  i11  Die  ideal  casej  is  useful  in  compressed  sens¬ 
ing  for  the  following  reasons. 


Uniqueness  of  Sparse  Representation  (£ 0  minimization)  The  following  results 
are  due  to  Tropp  [240],  and  Gurevich  and  Hadani  [136]),  and  show  that  with  over¬ 
whelming  probability  the  minimization  program  successfully  recovers  the  original 
/c-sparse  signal. 

Theorem  12.1  ([239,  136]).  Assume  the  dictionary  $  satisfies  //  <  where  c  is 
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an  absolute  constant.  Further  assume  k  <  prp^riv-  Let  S  be  a  random  subset  of  [iV] 
of  size  k,  and  let  $5  be  the  corresponding  M  x  k  submatrix.  Then  there  exists  an 
absolute  constant  cq  such  that 


Pr 


^5^*5  —  l\\  >  C0 


/a  log  LI  2 


<2  N~\ 


Theorem  12.2  ([240,  238]).  Assume  the  dictionary  $  satisfies  n  <  where  c  is 
an  absolute  constant.  Further  assume  k  <  ^^osN-  Let  ck*  be  a  k-sparse  vector,  such 
that  the  support  of  the  k  nonzero  coefficients  of  a*  is  selected  uniformly  at  random. 
Further  assume  that  conditioned  on  the  support,  the  values  of  the  k  non-zero  entries 
of  a*  are  sampled  from  a  distribution  which  is  absolutely  continuous  with  respect  to 
the  Lebesgue  measure  on  Then  with  probability  1  —  O  (N^1),  a*  is  the  unique 
k-sparse  vector  mapped  to  f  =  4>a*  by  the  measurement  matrix 


Sparse  Recovery  via  LASSO  {i\  minimization)  Uniqueness  of  sparse  represen¬ 
tation  is  of  limited  utility  given  that  £q  minimization  is  computationally  intractable. 
However,  given  modest  restrictions  on  the  class  of  sparse  signals,  Candes  and  Plan  [50] 
have  shown  that  with  overwhelming  probability  the  solution  to  the  £0  minimization 
problem  coincides  with  the  solution  to  a  convex  LASSO  program. 

Theorem  12.3.  Assume  the  dictionary  $  satisfies  p  <  where  c  is  an  absolute 

constant.  Further  assume  k  <  pdp  ()n  .v  ,  where  c\  is  a  constant.  Let  a*  be  a  k-sparse 
vector,  such  that 

1.  The  support  of  the  k  nonzero  coefficients  of  a.*  is  selected  uniformly  at  random. 

2.  Conditional  on  the  support,  the  signs  of  the  nonzero  entries  of  a*  are  indepen¬ 
dent  and  equally  likely  to  be  —1  or  1. 

Let  f  =  4>a  +  ejvf ,  where  eM  contains  M  iid  A/”(0,  a2)  Gaussian  elements.  Then  if 
1 1 op  1 1 min  >  8ct  y/2  log  N ,  with  probability  1  —  0(N _1)  the  LASSO  estimate 

a.  =  arg  min  -11/  —  <&a:||2  +  2  J 2  log  N  a  II alb 
c*e RN  2 

has  the  same  support  and  sign  as  a*,  and  ||<&a*  —  <La||2  <  c2ka2,  where  c2  is  a 
constant  independent  of  a* . 


Stochastic  noise  in  the  data  domain.  The  tight-frame  property  of  the  sensing 
matrix  makes  it  possible  to  map  iid  Gaussian  noise  in  the  data  domain  to  iid  Gaussian 
noise  in  the  measurement  domain: 

Lemma  12.4.  Let  be  a  vector  with  N  iid  A/"(0,  a2D )  entries  and  be  a  vector 
with  M  iid  J\f{f),a2M)  entries.  Let  h  =  and  e  —  h  +  eM-  Then  e  contains  M 
entries,  sampled  iid  from  J\f  (0,  a2),  where  a2  =  jjO~2d  +  cr2M. 
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Proof.  The  tight  frame  property  implies 


E  [hhT]  =  E[$>eDeDT$T }  =  a2D<S><f>T  =  I. 

Therefore,  e  =  contains  iid  Gaussian  elements  with  zero  mean  and  variance 

(X2.  □ 

Next  we  construct  a  family  of  low-coherence  tight  frames  with  optimal  coherence 
parameters  using  Delsarte-Goethals  codes. 


12.2  Construction  of  the  Delsarte-Goethals  Frames 

12.2.1  Delsarte-Goethals  Sets  of  Binary  Symmetric  Matrices 

The  finite  held  F2™  is  obtained  from  the  binary  held  F2  by  adjoining  a  root  f  of  a 
primitive  irreducible  polynomial  g  of  degree  m.  The  elements  of  F2m  are  polynomials 
in  £  of  degree  at  most  m  —  1  with  coefficients  in  F2,  and  we  will  identify  the  polynomial 
x0  +  +  •  •  •  +  with  the  binary  m-tuple  (x0»  •  •  •  ,  xm_i) .  The  Frobenius 

map  f  :  F2™  — >  F 2™  is  dehned  by  f(x)  =  x 2  and  the  Trace  map  Tr  :  F2™  — >  F2  is 
dehned  by 

Tr(x)  =  x  +  x2  +  •  •  •  +  x 2™  1 . 

The  identity  (x  +  y)2  =  x2  +  y2  implies  that  Tr(x  +  y)  —  Tr(x)  +  Tr(y);  the  trace  is  a 
linear  map  over  the  binary  held  F2.  The  trace  inner  product  given  by  (v,  w )  =  Tr(nrn) 
is  non-degenerate;  if  Tr (vz)  =  0  for  all  z  in  F™  then  v  —  0.  Every  element  a  in 
F2m  determines  a  symmetric  bilinear  form  Tr [xya]  to  which  is  associated  a  binary 
symmetric  matrix  P°(a).  That  is,  P°(a)  is  a  binary  matrix  such  that  for  every  held 
elements  x  and  y 


Tr  [xya]  =  (x0  ■  ■  •  xm_1)P0(a)(yo  ■  ■  ■  ym-\)T  ■ 


The  Kerdock  set  Km  is  the  m-dimensional  binary  vector  space  formed  by  the  matrices 
P°(a).  For  example,  let  m  =  3,  and  assume  the  finite  held  F8  is  generated  by  adjoining 
a  root  £  of  the  polynomial  g(x)  =  a;3  +  x  +  1.  Then  /I3  is  spanned  by 

/  1  0  0  \  /  0  0  1  \  /  0  1  0  \ 

P°(100)  =  0  0  1,  P°(010)  =  0  1  0  ,  and  P°(001)  =101 

\  0  1  0  /  \  1  0  1  /  \  0  1  1  / 

Theorem  12.5.  Every  nonzero  matrix  in  K.m  is  nonsingular. 

Proof.  If  xP°(a )  =  0  then  Tr  [xya]  =  0  for  all  y  G  F  2  m  .  Now  the  non-degeneracy  of 
the  trace  implies  a  =  0.  □ 
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Next  we  define  higher  order  bilinear  forms,  each  associated  with  a  binary  symmetric 
matrix.  Given  a  positive  integer  t  where  0  <  t  <  and  given  a  field  element  a 


Tr  (xy2*  +  x2±  y j  a 


defines  a  symmetric  bilinear  form  that  is  represented  by  a  binary  symmetric  matrix 
P*(a)  as  above: 


Tr  (xy2t+x2ty^ja  =  (x0  ■  ■  ■  xrn_1)Pt(a)(y0  ■  ■  ■  ym~i)T 
The  Dels  arte- Goethals  set  DG(m,r )  is  then  defined  as 

DG(m,r)  =  |  ^  Pt(at )  \  at  G  F '2m,  t  —  0, 1,  •  •  •  ,r 


(12.2.1) 


The  Delsarte-Goethals  sets  are  nested 


Krn  =  DG(m,  0)  C  DG(m,  1)  C  •  •  •  C  DG  (  m, 


rn  —  1 


and  every  bilinear  form  is  associated  with  some  matrix  in  DG  (m,  ■ 

For  example,  let  rn  —  3  and  g{x)  =  a:3  +  x  +  1,  the  set  DG( 3, 1)  is  spanned  by  K3, 


0  0  0 


0  1  0 


0  1  0 


P^lOO)  =  0  0  1,  P\ 010)  =  10  0,  and  P^OOl)  =  100 


0  0  0 


Oil 


1  0  0 


Theorem  12.6.  Every  nonzero  matrix  in  DG(m,r)  has  rank  at  least  m  —  2 r. 
Proof.  If  x  is  in  the  null  space  of  J2t.=o  Pt{at)i  then  for  all  y  G  F2™ 

r 

Tr  xyao  +  (xy2  +  x2  y)  at  =0. 


Since  Tr(x)  =  Tr(a;2)  =  •  •  •  =  Tr  we  have 


Tr  ^(xa0)2  +  X]  {[xatf  +af /+rjJ/  =  0. 


Non-degeneracy  of  the  trace  now  implies 


(xaQfr  +  ^  ((xat)2r  *  + 


2r  2t+r  \  n 

at  x  =  0. 


The  LHS  is  a  polynomial  of  degree  at  most  22r  so  there  are  at  most  22r  solutions. 
Hence  the  rank  of  the  binary  symmetric  matrix  Ylt=o  ^ (at)  is  at  least  m  —  2 r.  □ 
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Delsarte-Goethals  Frames  for  Compressed  Sensing 


In  Chapter  11  we  introduced  two  fundamental  measures  of  coherence  between  the 
columns  of  a  tight-frame,  and  showed  how  these  parameters  can  be  related  to  the 
performance  of  the  LASSO  and  OST  algorithms  in  model  selection  and  sparse  recon¬ 
struction.  In  this  section  we  construct  an  explicit  sensing  matrix  ( Delsarte-Goethals 
frame  [45,  47])  with  optimal  worst-case  coherence  //  and  average  coherence  u.  We 
start  by  picking  an  odd  number  m.  The  2m  rows  of  the  Delsarte-Goethals  frame  $ 
are  indexed  by  the  binary  m-tuples  x ,  and  the  2(r+2)m  columns  are  indexed  by  the 
pairs  (P,  b ),  where  P  is  an  m  x  m  binary  symmetric  matrix  in  the  Delsarte-Goethals 
set  DG(m,r),  and  b  is  a  binary  m-tuple.  The  entry  (fi(p,b),x  is  given  by 


T(P,b),x 


1  -wt(dp)  +  2 wt(b)  -xPxT  +  2 bxT 

, -  i  L 

\[M 


(12.2.2) 


where  dp  denotes  the  main  diagonal  of  P,  and  wt  denotes  the  Hamming  weight 
(the  number  of  Is  in  the  binary  vector).  Note  that  all  arithmetic  in  the  expressions 
xPxT  +  2 bxT  and  wt(dp)  +  2 wt{b)  takes  place  in  the  ring  of  integers  modulo  4,  since 
they  appear  only  as  exponents. 


The  Delsarte-Goethals  set  DG(m,  r),  defined  in  Section  12.2.1,  is  a  binary  vector 
space  containing  2v+1)m  binary  symmetric  matrices  with  the  property  that  the  binary 
sum  of  any  two  distinct  matrices  has  rank  at  least  m  —  2r  (See  [140]).  The  first  set 
DG(m,  0)  is  the  classical  Kerdock  set,  and  the  last  set  DG(m ,  is  the  set  of  all 

binary  symmetric  matrices.  Given  P  and  b,  the  vector  xPxT  +  2 bx7  is  a  codeword 
in  the  Delsarte-Goethals  code.  The  set  DG(m,  0)  corresponds  to  Kerdock  codes,  and 
the  set  DG  (m,  corresponds  to  all  codewords  of  the  second-order  Reed-Muller 
codes.  We  refer  the  reader  to  [41],  [43],  and  [42]  for  further  details. 

The  rth  Delsarte-Goethals  frame  is  determined  by  DG(m,  r )  and  has  M  =  2m  rows 
and  N  =  2F+2)m  columns.  For  a  fixed  matrix  P,  the  2m  columns  (b  G  F™) 

form  an  orthonormal  basis  T p  that  can  also  be  obtained  by  postmultiplying  the 
Walsh-Hadamard  basis  by  the  unitary  transformation  diag 


:xPx 1 


Throughout  the  rest  of  this  section  let  1  denote  the  all-one  vector.  Also  let  A  denote 
the  unnormalized  DG  frame,  i.e.,  $  =  ^=A.  We  use  the  following  lemmas  to  show 
that  the  Delsarte-Goethals  frames  are  low-coherence  tight-frames.  First  we  prove 
that  the  columns  of  the  rth  Delsarte-Goethals  sensing  matrix  form  a  group  under 
pointwise  multiplication. 


Lemma  12.7.  Let  Q  =  Q(m,  r )  be  the  set  of  unnormalized  columns  A(p^)  where 
a{P,b),x  =  iwt(dp)  +  2wt (6)  txPxT  +  2bxT  ,  where  x  G  F™ 


where  b  G  F™  and  where  the  binary  symmetric  matrix  P  varies  over  the  Delsarte- 
Goethals  set  DG(m,r).  Then  Q  is  a  group  of  order  2<-r+2'>m  under  pointwise  multipli¬ 
cation. 
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Proof.  We  have 


a{P,bUa(pl^x  =  iwt(d p)  +  wt^  +  2wt(b  ®  6')  ix^P  +  P>T  +  2(6  ®  6>T 

where  ©  is  used  to  emphasize  addition  in  F™.  Write  P  +  P'  =  (P©P')  +2Q  ( mod  4) 
where  Q  is  a  binary  symmetric  matrix.  Observe  that  =  2 cIqxt  (mod,  4),  where 

the  diagonal  cIq  =  dp  *  dp>  is  a  pointwise  product  of  dp  and  dp/. 

Thus  ci[(p,b),x]  ^[(P'  ,bf),x]  equals 

j([wt(dp)+wt(dpi)+2wt{dp*dpi)]+2wt(b®b'®dp*dpi))  jX{P+P')xT  +2{b®b' ®dp*dpi)xT  2  3) 

which  is  equal  to  ci^p^pt tb®b'®dp*dP,),t]-  Therefore,  Q  is  closed  under  pointwise  multi¬ 
plication,  and  the  possible  inner  products  of  columns  A(p)b),  A(p/)b')  are  exactly  the 
possible  column  sums  for  columns  Aq where  Q  =  P©P'  and  b"  =  b®b' ®dp*dp’ .  □ 


Next  we  bound  the  worst-case  coherence  of  the  Delsarte-Goethals  frames. 


Theorem  12.8.  Let  Q  be  a  binary  symmetric  m  x  m  matrix  from  the  DG(m,  r )  set, 
and  let  b  e  F™.  If  S  =  Ylxew?  ix^x  ®  2bx  ,  then  either  S  —  0,  or 

S2  =  2m  +  2 riVlQvJ  +  2bvJ ?  w/iere  = 


Proof.  We  have 

g2  _  +  uQut  +  26(a;  +  u)T 

x,u£  F™ 


©  u)Q(x  ©  w)T  +  2xQut  +  2b(x  ©  u)T 

x,u& F™ 


Changing  variables  to  v  —  x  ©  u  and  u  gives 


s2=£ 


;vQvT+2bv 


Y  (-1  j(dQ  +  vQ)uT 


ue  Fs 


Since  the  diagonal  dQ  of  a  binary  symmetric  matrix  Q  is  contained  in  the  row  space 
of  Q  there  exists  a  solution  for  the  Equation  vQ  =  dQ.  Moreover,  since  Q  has  rank  at 
least  m  —  2r,  the  solutions  to  the  Equation  vQ  =  0  form  a  vector  space  E  of  dimension 
at  most  2 r,  and  for  all  e,  /  e  E 

eQeT  +  fQfT  =  (e  +  f)Q(e  +  f)T  (mod  4). 


Hence 


g2  _  2m  i(l’i  +  e)Q(v i  +  e)T  +  2(ui  +  e)6T  _  2m-viQzJ  +  2ui&t  , 


■eQeT  +  2e6 


e£E 


e£E 
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The  map  e  — >  eQeT  is  a  linear  map  from  E  to  Z2,  so  the  numerator  eQeT  +  2ebT  also 
determines  a  linear  map  from  E  to  Z2  (here  we  identify  Z2  and  2Z4).  If  this  linear 
map  is  the  zero  map  then 

g2  _  2m  +  2 TjViQvJ  +  2 bvj 


and  if  it  is  not  zero  then  S  =  0.  □ 

Corollary  12.9.  Let  $  be  an  M  x  N  DG(m,r )  frame  whose  column  entries  are 
defined  by  (12.2.2).  Then  //  <  ^=. 

Proof.  Lemma  12.7  states  that  the  columns  of  the  unnormalized  DG  frame  form  a 
group  under  pointwise  multiplication.  Therefore,  the  inner  product  between  any  two 
columns  of  this  matrix  equal  to  the  sum  of  the  entries  of  another  column  of  the  matrix. 
Consequently,  we  have 

1  1  v/2m+2r  2r 

a  =  max  I  (<&j,  <E> I  =  —  max  I  (A;.  Af)  I  =  —  max  I  (A,,  1)  I  <  - 

p  i^j  lx  ’  jn  2m  lx  jn  2m  &  1  lx  ’  n  ~  2m 

□ 

Next  we  show  that  the  Delsarte-Goethals  frames  have  significantly  smaller  average- 
coherence.  In  fact,  the  Delsarte-Goethals  frames  achieve  the  lowest  average-coherence 
among  all  known  low-coherence  tight-frames  (see  Table  11.1.) 

Lemma  12.10.  Let  $  be  a  DG(m,r )  frame  with  M  =  2m ,  and  N  =  M^r+2\  Then 


Proof.  We  have 


1 

v  =  max - 

i  N-  1 


M(N  -  1) 


jW 


1 

M(N  -  1) 


i/l 


Now  since  the  columns  of  A  form  a  group  under  pointwise  multiplication,  and  since 
every  row  of  A  has  at  least  one  non-identity  element,  it  follows  from  Lemma  2.17 
that  every  row  sum  vanishes.  Therefore,  Xlie[7V]  1T^-*  =  and  since  A\  =  1  we  have 


1 

M(N  -  1) 


Af) 

iyo 


1 

M(N-l) 


-1T1 


1 

N-l 


□ 

Lemma  12.11.  Let  $  be  a  DG(m,  r )  frame.  Then  $  is  a  tight-frame  with  redundancy 


Proof.  Let  x  and  x'  be  two  indices  in  [M].  We  calculate  the  inner-product  between 
the  rows  indexed  by  x  and  x' .  It  follows  from  Equation  (12.2.2)  that  the  inner-product 
can  be  written  as 


^  ^  T(P,b),xT(P,b),x 
P,b 


1 

M 


■xPxT  —x'Px,T  +2  bxT —2bx'T 
t 


P,b 


Therefore,  it  follows  from  Lemma  2.17  that  if  x  ^  x'  then  the  inner-product  is  zero, 
and  is  otherwise.  □ 


Putting  all  the  above  results  together  we  obtain  the  following  theorem  regarding  the 
coherence  of  the  Delsarte-Goethals  frames 


Theorem  12.12.  Let  M  be  an  odd  integer,  and  let  r  be  a  positive  integer  not  larger 
than  22^.  Let  M  =  2m,  and  N  =  2^r+2')m.  Let  $  be  the  M  x  N  DG  frame,  whose 
columns  are  generated  by  Equation  (12.2.2).  Then 


1.  $  is  a  tight-frame  with  redundancy  jj. 

2.  $  is  maximally  incoherent.  That  is  and  //(<£)  = 

Proof.  The  tight-frame  property  follows  from  Lemma  12.11.  Lemma  12.9  bound  the 
worst-case  coherence  of  and  Lemma  12.10  calculates  its  average  coherence.  □ 

Remark  12.13.  The  explicit  structure  of  the  DG  frames  also  provides  storage  and 
computational  advantages  over  random  Gaussian  and  Rademacher  matrices.  To  see 
this,  observe  that  each  DG  frame  has  the  form 


$ 


D\HB ,  D2HB,  •  •  •  ?  Dj\tfr+iHB  , 


(12.2.4) 


where  B  is  a 


diagonal  vector  with  entries  (—1)™^) 


H  is  the  unnormalized 


Hadamard  matrix,  and  each  is  the  diagonal  matrix  diag 


jWt(Pj)-\-xPjX 1 


PjEDG(m,r) 


As  a  result,  matrix-vector  multiplications  and  &  u  only  require  0(N  log  M)  run¬ 
ning  time  via  the  Fast  Hadamard  Transform. 


12.2.2  Real- Valued  Delsarte-Goethals  Frames 

In  certain  cases,  we  desire  a  CS  matrix  with  real-valued  entries;  there  are  two  possible 
approaches  to  adapt  the  DG  frames  to  a  real-valued  CS  matrix. 
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First,  one  can  restrict  the  binary  symmetric  matrices  P  to  the  subset  of  the  DG  set 
of  matrices  with  zero-valued  diagonal  entries.  With  such  a  restriction,  the  term 

xPxT  =  2  XiXjPij 

0<i<j<2m 

is  an  even  number,  rendering  the  entries  of  $  real-valued.  However,  since  only  zero- 
diagonal  binary  symmetric  matrices  are  used,  a  zero-diagonal  subset  of  the  DG(m,r+ 
1)  set  is  required  in  order  to  obtain  N  =  Mr+l  columns.  As  a  result,  Corollary  12.9 
implies  that  the  worst-case  coherence  of  the  new  real-valued  sensing  matrix  is  by  a 
factor  of  \/2  larger  than  the  worst-case  coherence  of  the  complex  DG  frame. 

Alternatively,  one  can  create  a  CS  matrix  having  twice  as  many  rows  as  the  DG  frame 
by  applying  the  Gray  map 


f  i-  (i,i)  ) 

,.J  (i,-1) 

3  1  -i-*  (-1,-1)  | 

[  -*->  (-U)  J 

to  the  entries  of  the  complex  DG  frame. 

The  Gray  map  has  the  property  that  the  norm  of  the  difference  between  any  two 
powers  of  i  is  equal  to  the  norm  of  the  difference  of  their  Gray  map  image  vectors. 
The  new  Gray-mapped  CS  matrix,  which  we  denote  by  4>G  has  M  =  2m+1  rows  and 
N  =  2<-r+‘2')m  columns.  The  rows  of  the  matrix  are  indexed  by  x  €  F™+1,  and  its 
columns  are  indexed  by  pairs  (G,  b ),  where  G  is  an  skew-symmetric  matrix,  i.e.  it  has 
zero  diagonals.  The  entry  at  row  x  and  column  (G,  b )  of  the  real-valued  DG  frame 
is  therefore 


vfm,  =  ^  i 2wt(i)  iXGxT  +  2faT,  (12.2.5) 

Calderbank  et.  al.  [140]  showed  that  the  correspondence  between  binary  symmetric 
matrices  P  in  the  complex-valued  DG  frames,  and  binary  skew  symmetric  matrices 
G,  in  the  real-valued  DG  frames  is  given  by 


0  dp 

dj,  dj,dp  +  P 


This  mapping  between  M  x  M  binary  symmetric  matrices  P,  and  (m  +  1)  x  (m  +  1) 
binary  skew-symmetric  matrices  G  allows  us  to  prove  that  real-valued  DG  frames  are 
incoherent  tight-frames. 

Theorem  12.14.  Let  m  be  an  odd  integer,  and  let  r  be  a  positive  integer  not  larger 
than  Let  M  =  2m+1,  and  N  =  2('r+2')m.  Let  $  be  the  M  x  N  real-valued  DG 
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frame,  whose  columns  are  generated  by  Equation  (12.2.5)  (or  equivalently  by  applying 
the  Gray  map  to  Equation  (12.2.2)).  Then  $  is  a  tight-frame  with  redundancy  jj, 
andp($)  < 


Proof.  The  tight-frame  property  follows  from  the  fact  that  the  real- valued  DG  frames 
are  unions  of  orthonormal  bases.  More  precisely,  the  inner  product  between  any  two 
distinct  rows  x  and  y  can  be  written  as 


1  -xGxt  +  2 bxT  —  yGyT  —  2 byT 

G,b 


1 

M 


jxGx1  -  yGyT 


(12.2.6) 


now  since  b  ranges  over  all  field  elements,  the  inner  sum  is  equal  to  zero  unless  x  =  y. 
On  the  other  hand,  if  x  =  y,  then 


—  V  ix(^xT  +  2fcrT  —  yGyT  —  2 byT  _  N 

'  M' 

G.b 


Moreover,  it  follows  from  the  distance  preserving  property  of  the  Gray  map  [41,  42], 
and  Corollary  12.9,  that  the  inner  product  between  any  two  distinct  columns  of  a 
real- valued  DG  frame  is  at  most  □ 

Remark  12.15.  Note  that  even  though  the  Gray  map  duplicates  the  number  of  re¬ 
quired  measurements,  it  also  decreases  the  worst  case  coherence  and  the  spectral  norm 
of  the  matrix  by  a  factor  ^ .  In  other  words,  a  complexed  DG(m,r )  frame  has 

Me  =  2m  rows,  Nc  =  2^r+2^m  columns,  worst-case  coherence  pc  =  J_r ,  and  spec- 

tral  norm  ||<f>c||  =  v^2(m+1)r,  whereas  a  real-valued  DG(m,r )  frame  has  Mr  =  2m+1 
rows,  Nm  =  2<'r+2'>m  columns,  worst-case  coherence  Ur  =  -Jj — ,  and  spectral  norm 

2  3  r 

II^rII  =  v/2(m+1)r-1. 


12.3  Efficient  Compressed  Sensing  via  the  Delsarte- 
Goethals  Frames 

So  far  we  have  proved  that  Delsarte-Goethals  frames  are  tight-frames  with  optimal 
coherence  values.  On  the  other  hand,  in  Chapter  11  we  proved  that  the  coherence 
property  of  Definition  11.1  is  a  sufficient  condition  for  the  fidelity  of  the  OST  algo¬ 
rithm  (Algorithm  8).  Later  on,  we  also  showed  that  if  a  sensing  matrix  has  sufficiently 
small  worst-case  coherence  and  spectral  norm,  then  most  fc-sparse  vectors  have  unique 
low- dimensional  representations  (Theorem  12.2),  and  the  LASSO  algorithm  can  suc¬ 
cessfully  recover  a  close  approximation  to  them  (Theorem  12.3).  Now  we  combine  the 
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Figure  12.1:  Probability  of  exact  signal  recovery  as  a  function  of  the  sparsity  level  k, 
and  the  data  domain  dimension  N  using  a  DG( 9,  0)  frame. 
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low-coherence  results  of  Theorem  12.12  and  provide  sparse  approximation  guarantees 
for  LASSO  and  OST  algorithms  combined  with  the  Delsarte-Goethals  frames. 

Theorem  12.16.  Let  m  be  an  odd  number,  and  let  r  be  an  integer  less  than 
Let  $  be  an  M  x  N  DG  frame.  Assume 

k<  gi M 

“  22r  log  N  ’ 

where  K\  is  an  absolute  constant.  Let  a*  be  a  k-sparse  vector,  such  that 

1.  The  support  of  the  k  nonzero  coefficients  of  a*  is  selected  uniformly  at  random. 

2.  CoJiditional  on  the  support,  the  signs  of  the  nonzero  entries  of  a. *  are  indepen¬ 
dent  and  equally  likely  to  be  —1  or  1. 

3.  The  distribution  of  the  k  non-zero  entries  of  ct*  is  absolutely  continuous  with 
respect  to  the  Lebesgue  measure  on  Rk 

Let  f  =  <&«*  +  +  eM,  where  eMD  is  the  data-domain  noise,  containing  N 

iid  J\f{  0,  erf)  Gaussian  elements,  and  eM  is  the  measurement- domain  noise,  con¬ 
tain  M  iid  J\f( 0,cr m)  Gaussian  elements.  Let  a  =  J +  cr2M.  Then  if  ||a||min  > 
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Figure  12.2:  Average  reconstruction  error  as  a  function  of  the  data  domain  noise  (cr/, 
and  the  measurement  domain  noise  (<xm)  using  a  DG( 9,  0)  frame. 


Measurement  Noise 


8 a  2  log  N,  with  probability  1  —  0(N  /  the  LASSO  estimate 


ct  =  arg  min  -  11/  -  $«l|2 
c*e RN  2 


+  2  a/2  log  N  a  || o: || ! 


has  the  same  support  and  sign  as  ct,  and  ||a*  —  ck||2  <  kcr2,  where  k.2  is  a  constants 
independent  of  ct* . 


Proof.  First,  note  that  without  loss  of  generality  we  can  assume  that  the  measurement 
noise  is  e  =  e £>  +  e^.  It  follows  from  Lemma  12.4  that  Then  e  contains  M  entries, 
sampled  iid  from  A/"(0,ct2).  Now  Theorem  12.3  guarantees  that  with  probability 
1  —  0(1V-1)  the  LASSO  estimate  ct  has  the  same  support  as  ct*,  and  moreover 
||3>a:*  —  $ci:||2  <  k  a2 .  Finally,  since  both  at  and  ct*  are  supported  on  the  same 
random  /c-subset  of  [N],  Theorem  12.1  guarantees  that  with  probability  not  exceeding 
jf,  ||«*  —  «||2  <  2||<ha:*  —  $ck||2,  which  completes  the  proof.  □ 


Theorem  12.17.  Let  m  be  an  odd  number,  and  let  r  be  an  integer  less  than  m^L.  Let 
$  be  an  M  x  N  DG  frame.  Suppose  eD  be  the  data-domain  noise  vector  containing 
N  iid  J\f  if) ,  a2D)  elements,  eM  is  the  measurement- domain  noise  vector  containing  M 

iid  A/”(0,  <j2m)  Gaussian  elements,  and  let  a  =  yj ^<y2D  +  cr2M.  Then  with  probability  at 
least  1  — 0(A^_1),  the  SOST  algorithm  (Algorithm  9)  successfully  recovers  the  support 
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Figure  12.3:  The  impact  of  the  noise  in  the  measurement  domain  on  the  sparse 
approximation  error  ||«*  —  « || 2/ 1| ck* || 2  of  the  LASSO  algorithm  with  real-valued  DG 
frames  (triangle),  and  random  Gaussian  matrices  (square).  Here  the  noise  standard 
deviation  ranges  from  10-6  to  10_1,  and  used  k  =  200,  M  =  1024,  and  N  =  3072. 

of  a*,  as  long  as  we  have  that  k  <  M/(21oglV)  and 

f  c2k  log  N  c'.,22rk  log  N 1 

mar  >  max  <  — — — — ,  — - ,  -  6  12.3.1 

1  M  •  SNR  M  J  v  ; 

Here,  c 2  and  c'3  are  absolute  constants,  and  the  probability  of  failure  is  with  respect  to 
the  true  model  S  and  the  noise  vectors  eM  and  e£>. 

Proof.  Since,  $  is  a  tight-frame  with  redundancy  jj,  it  follows  from  Lemma  12.4  that 
Then  e  contains  M  entries,  sampled  iid  from  A/"(0,cr2).  The  proof  of  Theorem  11.4 
then  follows  directly,  from  Theorem  11.3  by  setting  /1  =  □ 

Here  we  present  numerical  experiments  to  evaluate  the  performance  of  the  LASSO 
program  with  DG  frames.  Here  we  used  a  complex-valued  DG  frame,  with  m  =  9, 
and  r  =  0.  We  fixed  the  number  of  measurements  to  M  =  512  and  swept  across 
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Figure  12.4:  Probability  of  exact  support  recovery  as  a  function  of  the  sparsity  level 
k.  Here  we  set  M  =  212,  N  =  232  and  used  a  real-valued  DG  frame.  The  OST 
algorithm  is  used  for  support  recovery.  Note  that  the  OST  algorithm  requires  less 
than  a  minute  to  terminate,  whereas  most  recovery  algorithms  designed  for  random 
(or  even  expander-based)  compressed  sensing  do  not  converge  in  a  reasonable  time. 

the  sparsity  level  k ,  and  the  data  dimension  N1.  For  each  (k,  AQ-pair,  we  repeated 
the  following  100-times:  (t)  generate  a  random  sparse  vector  with  unit  norm  (ii) 
generate  compressive  measurements  (no  noise)  using  the  DG  frame,  and  (Hi)  recover 
the  signals  using  LASSO.  Figure  12.1  reports  the  probability  of  exact  recovery  over 
the  100  trials. 

We  also  performed  a  similar  experiment  in  the  noisy  regime.  Here  we  independently 
changed  the  standard  deviations  of  the  data-domain  noise  (c^)  and  the  measurement 
noise  (am)  from  10~6  to  1CT1.  We  then  used  the  LASSO  program  to  obtain  a  sparse 
approximation  a  to  the  fc-sparse  vector  ol.  Figure  12.2  plots  the  average  reconstruc¬ 
tion  error  (— 10  log10(||o:  —  ck||2))  as  a  function  of  crM  and  a D. 

Figure  12.3  plots  the  sparse  approximation  error  as  a  function  of  the  noise  in  the 
measurement  domain.  In  the  measurement  noise  study,  a  J\f( 0,  a2M )  iid  measurement 
noise  vector  is  added  to  the  sensed  vector  to  obtain  the  M  dimensional  vector  /.  The 
original  fc-sparse  signal  a*  is  then  approximated  by  solving  the  LASSO  program  with 
A  =  2y/2  log  N&M-  Figure  12.3  and  many  similar  experiments  show  that  real- valued 
DG  frames  and  random  Gaussian  matrices  of  the  same  size  have  almost  identical  per¬ 
formance  in  terms  of  noisy  signal  recovery  using  the  LASSO.  However,  in  contrast  to 
random  Gaussian  matrices,  DG  frames  do  not  suffer  from  storage  and  computational 
limitations. 

1To  vary  N,  we  selected  the  first  N  columns  of  a  DG( 9,  0)  frame  (which  is  still  an  incoherent 

tight-frame  as  long  as  jj  is  an  integer). 
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Finally,  consider  a  wireless  network,  in  which  each  user  is  identified  by  a  unique  32  bit 
MAC  address.  Let  $  be  a  2 12  x  232  sensing  matrix  obtained  from  selecting  the  first 
232  columns  of  a  real-valued  Delsarte-Goethals  frame,  based  on  a  DG(11, 1)  set.  Here 
each  user  is  assigned  to  one  column  of  #.  At  the  transmission  time,  each  active  user 
i,  simply  submits  the  212  dimensional  column  <~pi.  Therefore,  the  receiver  receives  a 
superposition  of  the  columns  of  all  active  users.  However,  at  each  time  period,  there 
are  only  very  few  k  <C  232  users  that  are  active.  The  receiver  can  exploit  this  prior 
sparsity  knowledge,  and  recover  the  active  users.  However,  the  number  of  active  users 
is  not  known  a  priori  to  the  receiver,  and  therefore  the  receiver  must  use  a  model-order 
agnostic  recovery  algorithm. 

In  Figure  12.4  we  used  the  OST  algorithm  for  support  recovery  of  sparse  vectors.  Here 
we  set  N  =  232,  M  =  212,  and  used  the  first  232  columns  of  a  real-valued  DG(11, 1) 
frame.  Note  that  here  the  data  dimension  N  is  too  large  for  convex  optimization 
methods  (or  even  most  greedy  methods)  to  converge  in  a  reasonable  time.  Even 
expander-based  methods  have  difficulties  in  this  situation.  Nevertheless,  the  OST 
algorithm  only  requires  one  matrix-vector  multiplication  using  the  Fast  Hadamard 
Transform.  In  our  implementation,  this  matrix-vector  multiplication  took  less  than 
one  minute.  The  running  time  of  the  OST  algorithm  can  be  significantly  reduced  by 
parallelizing  the  algorithm. 

In  this  experiment,  the  sparsity  level  k  was  changed  from  16  to  32.  The  sparse 
vector  was  generated  by  selecting  a  random  /c-subset  of  the  232  columns.  To  avoid 
the  random-sign  effect,  the  k  non-zero  entries  of  a*  all  had  value  one.  We  repeated 
each  experiment  100  times  independently,  and  recorded  the  average  probability  of 
exact  recovery  as  a  function  of  the  sparsity  level  k.  As  illustrated  in  Figure  12.4,  the 
algorithm  can  recover  almost  all  k- sparse  vectors  efficiently  as  long  as  k  is  smaller 
than  25. 
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Part  V 

Model-based  Compressed  Sensing 
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Chapter  13 


Fast  Model-based  Thresholding 
with  Nesterov’s  Gradient  Method 

13.1  What  is  Model-based  Compressed  Sensing 

In  Section  3.1  we  introduced  the  main  objectives  of  compressed  sensing  as 

•  Designing  an  efficient  M  x  N  sensing  matrix  <h,  and 

•  Designing  an  efficient  and  robust  reconstruction  algorithm 

We  then  proposed  expander-based  and  Reed-Mullcr-based  matrices  as  two  families 
of  deterministic  and  efficient  sensing  matrices.  We  also  looked  at  the  general  prob¬ 
lems  of  sparse  approximation  and  model  selection,  and  introduced  the  GAME  and 
OST  algorithms  as  examples  of  two  efficient  sparse  recovery  algorithms,  with  sparse 
approximation  and  model-selection  guarantees.  We  further  showed  that  in  many 
practical  situations  our  proposed  matrices  and  recovery  algorithms  have  similar  or 
even  better  performance  compared  to  the  random  sensing  framework.  Moreover  the 
structures  of  our  deterministic  matrices  provide  several  storage  and  computational 
advantages. 

While  such  measurement  rates  and  recovery  complexities  are  impressive  and  have 
the  potential  to  impact  a  broad  set  of  compressive  sensing  applications,  sparsity  is 
merely  a  first-order  description  of  signal  structure;  in  many  applications  we  have 
considerably  more  a  priori  information  on  the  sparse  coefficients  of  the  state-of-the- 
art  approaches  are  only  recently  beginning  to  exploit  [19].  For  instance,  in  group 
testing  of  defective  elements  among  a  collection  of  N  items,  the  defective  elements 
typically  cluster  across  known  blocks  of  items.  In  data  streaming  computing,  large 
elements  of  oc*  often  have  a  minimum  distance  between  each  other,  and  have  positive 
values.  In  compressive  imaging,  the  sparse  coefficients  of  the  signal  cluster  across  the 
branches  of  tree  structures,  such  as  natural  images  and  wavelet  trees  [77,  81]. 
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In  this  chapter,  we  will  show  that  by  exploiting  a  priori  information  on  coefficient 
structure  in  addition  to  signal  sparsity,  we  can  make  the  sparse  linear  sketching  ap¬ 
proaches  to  dimensionality  reduction  more  powerful.  As  an  indicator  of  what  can  be 
achieved,  recent  work  [19,  67]  leverages  such  structured  sparsity  in  CS  with  random 
dense  matrices  to  reduce,  in  some  cases  significantly,  the  number  of  measurements 
M  required  to  stably  recover  a  signal  by  permitting  only  certain  configurations  of 
the  large  and  zero/small  coefficients  via  dependencies  on  the  support  of  the  sparse 
coefficients-the  set  of  indices  corresponding  to  the  nonzero  entries.  During  signal  re¬ 
covery,  structured  sparsity  also  enables  the  recovery  algorithms  to  better  differentiate 
true  signal  information  from  recovery  artifacts,  leading  to  a  more  robust  recovery. 

We  will  derive  an  algorithmic  framework  for  structured  sparse  recovery,  which  unifies 
combinatorial  optimization  with  the  non-smooth  convex  optimization  framework  by 
Nesterov  [204,  203].  The  algorithm  proposed  in  this  chapter  can  be  viewed  as  a 
generalization  of  the  OST  algorithm  derived  in  Section  11.2.  In  our  approach,  we 
optimally  use  the  gradient  information  in  the  convex  data  error  objective  to  navigate 
over  the  non-convex  set  of  structured  sparse  signals.  By  optimal,  we  mean  that  our 
algorithms  match  the  known  convergence  bounds  for  gradient  methods  for  convex 
optimization  problems  [202], 

Efficient  combinatorial  optimization  is  the  key  ingredient  in  this  loop  to  calculate  the 
best  projection  of  a  given  vector  onto  the  non-convex  sparse  signal  set.  Our  combi¬ 
natorial  approach  with  expander-based  and  Reed-Mullcr-based  matrices  achieves  the 
geometric,  optimization-based  limits  for  random  dense  matrices  [98]  at  a  fraction  of 
their  computational  cost. 

13.2  Problem  Formulation 

Model-based  compressed  sensing  is  the  topic  of  efficient  sparse  recovery  when  some 
extra  prior  knowledge  is  available  about  the  sparse  vector  a*.  Throughout  this  chap¬ 
ter,  we  identify  any  extra  prior  knowledge  about  the  sparse  vector  a*  by  a  model  M.. 
We  also  use  the  notation  to  denote  the  set  of  all  k- sparse  vectors  in  that  model 

M. 

Let  $  be  an  arbitrary  M  x  N  matrix,  and  let  f  be  an  M  dimensional  vector.  In  the 
model-based  sparse- approximation  in  £2  norm ,  we  focus  on  the  following  non-convex 
optimization  problem. 


minimize TZ(cx)  s.t.  a.  €  ^M(k) 


(13.2.1) 


where  the  loss  function  7Z(at)  :  S;n(fc)  — >  K+  is  defined  as  the  squared  £ 2  loss 


TZ(ac)  =  ||/  —  <&a 


2 

2- 


(13.2.2) 
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Here  E_v((fc)  is  any  fc-dimensional  restricted  union-of-subspace  (RUS)  model  with  a 
tractable  approximation  algorithm  Ai  that  can  calculate  the  projection  of  any  v  e  M.N 
into  £.M(fc): 

Mk(v)  =  arg  min  ||v  —  ck||2.  (13.2.3) 

Tractable  RUS  models  include  but  are  not  limited  to  (T)  k- sparse  signals  (E*,),  ( ii ) 
(k,  6)-sparse  signals  where  fc-sparse  coefficients  live  in  at  most  b  unknown  contiguous 
blocks  on  a  chain  graph,  (to)  fc-tree  sparse  where  fc-sparse  coefficients  lie  on  a  rooted 
connected  subtree  of  an  TV-dimensional  tree,  and  (iv)  (k,  A)-sparse  signals  where 
fc-sparse  coefficients  are  separated  by  at  least  A  zeros  on  a  chain  graph.  For  E&, 
Mk  is  simple  hard  thresholding  based  on  sorting  the  signal  coefficients  in  terms  of 
decreasing  magnitude  and  keeping  the  largest  k  while  setting  the  others  to  zero. 
For  other  models,  efficient  combinatorial  and  mixed  integer  model  approximation 
algorithms  exist  [20]. 

In  the  rest  of  this  chapter  we  propose  an  efficient  algorithm  for  efficiently  solving 
Equation  (13.2.1).  In  Section  13.3,  we  set  up  key  properties  of  objective  function 
that  the  later  sections  build  upon.  Section  13.4  provides  the  proposed  NIHT  algo¬ 
rithms,  and  Section  13.5  illustrates  the  compressive  sensing  performance  of  NIHT, 
and  compares  it  with  the  ix  minimization  algorithms. 


13.3  Bregman  Proxies  for  Model-Based  Sparse  Ap¬ 
proximation 

The  model-based  sparse  approximation  problem  is  not  only  ill-posed  (since  the  matrix 
has  a  nontrivial  kernel),  but  is  also  known  to  be  TCP- hard  [197].  In  other  words, 
there  is  no  hope  of  being  able  to  find  an  exact  solution  for  Equation  (13.2.1)  in  general. 
However,  in  Chapter  6.1  we  proposed  efficient  algorithms  for  approximately  solving 
the  problem  of  sparse  approximation  in  the  £q  norm.  Here  we  use  similar  ideas  and 
show  that  even  though  the  model-based  sparse  approximation  is  TCP-had,  it  is  still 
possible  to  propose  efficient  algorithms  for  approximately  solving  it. 

Similar  to  Chapter  6.1,  we  start  by  defining  a  proper  Bregman  function  (see  Sec¬ 
tion  7.1  for  definitions).  We  use  the  loss  function  TZ(ot)  =  \\f  —  #o:|||  as  the  Bregman 
function.  The  following  lemma  relates  that  the  Bregman  distance  between  any  two 
points  a,  ol'  G  E M(k)  to  their  Euclidean  distance  in  the  measurement  domain. 

Lemma  13.1.  Let  be  an  M  x  N  matrix,  and  f  be  an  M  dimensional  vector.  Define 
the  Bregman  function  7 Z(a)  :  E M(k)  as 

K(a)  =  ||  f  -  <&ol\\1. 

Then  the  Bregman  distance  between  any  two  model-sparse  vectors  ck,  ol'  £  E^*,)  is 

«')  =  !!$(«- aOlli 
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Proof.  The  proof  of  Lemma  13.1  relies  on  some  obvious  algebraic  manipulation.  From 
the  definition  of  the  Bregman  distance  (Definition  7.1)  we  have 


a')  =  K{a)  -  K(a')  -  ((a  -  a'),  Vft(a')) 

=  11/  -  *a||ij  -  11/  -  Sa'llii  +  2((a  -  a'),  #T(/  -  #«')> 

=  \\&a\\l  -  \\®a'\\l  -  2 (/,  #(«  -  a'))  +  2 ((/  -  #«'),  #(«  -  a')) 
=  (#(a  +  a'),  #(«  -  a'))  -  2(#a\  #(a  -  a'))  =  ||#(a  -  a')||2. 


□ 

Throughout  the  rest  of  this  chapter  let  L2fc  denote  the  restricted  Lipschitz  constant 

of#: 

||#(a-aO||I<L2fe||a-a'||2, 

for  every  ot,  cx 1  G  ^M(k)-  Observe  that  from  Lemma  13.1  we  have  B-ji(cx,  ex')  <  L2fc||ai  — 
ck7 HI;  therefore  if  we  define, 

U (ex,  a')  =  f(a')  +  ((ex  -  ex'),  Vf  (ex'))  -  L2fc||o:  -  ex' |||, 
then  for  every  ex'  G  ^Mik)  we  have 

min  f(cx)<  min  U(ex,cx). 

OL&T.M(k)  a^M(k) 

In  the  next  section  we  propose  an  iterative  approximation  algorithm  that  iteratively 
selects  a  proper  value  cxn  through  the  Nesterov  scheme  [204],  and  then  solves  the 
corresponding  optimization  problem 

minimizeaesA1(fc) U (a,  ex"), 

as  a  proxy  for  minimizing  the  loss  function/ (a). 


13.4  Algebraic  pursuits  and  the  NIHT  algorithm 

In  [68],  Cevher  proposes  two  algorithms,  called  Algebraic  Pursuits  (ALPS),  that 
fuse  Nesterov’s  optimal  gradient  methods  with  combinatorial  model-based  projec¬ 
tion  algorithms  for  sparse  approximation.  For  instance,  the  fast  Lipschitz  iterative 
hard  thresholding  (FLIHT)  scheme  of  ALPS  has  the  following  recursion  (a*+i  = 
0.5  (l  +  a/1  +  4 at2),  a1  =  1,  and  9t  = 

at+1  =  xMk  (V  -  V/(y*)V  yt+1  =  of  +  -  of-1).  (13.4.1) 

V  J^3K  ) 

Here  we  propose  a  third  algebraic  pursuit  algorithm,  called  NIHT  for  Nesterov  Iter¬ 
ative  Hard  Thresholding.  The  algorithm  is  based  upon  Nesterov’s  proximal  gradient 
method  [204],  The  pseudocode  of  the  NIHT  algorithm  is  summarized  as  Algorithm  10. 
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Algorithm  10  NIHT  Algorithm  for  Model-Based  Sparse  Approximation  in  £2- norm. 
Inputs:  M- dimensional  vector  f .  M  x  N  matrix  <&,  and  number  of  iterations  T 
Output:  TV-dimensional  vector  6t 

0.  Set  c*0  =  Oat  and  x°  =  O^v- 
for  t  —  1, . . . ,  T  do 

1.  Set  x *  =  aF-1  +  2^1)#T(/  — 

2.  Set  yl  =  cU_1  +  —  thcU-1). 

3.  Set  a t  =  Mk  ( rtx 1  +  (1  -  r^y1) .  where  rt  =  ^ . 
end  for 

6.  Output  6l  =  ctT. 


The  proposed  NIHT  algorithm  is  a  first-order  gradient  projection  algorithm.  By 
first-order,  we  mean  that  at  each  iteration,  the  algorithm  only  requires  one  gradient 
calculation  and  a  /c-sparse  model  approximation.  If  an  expander-based  or  a  Reed- 
Mullcr-based  sensing  matrix  is  used,  the  matrix-vector  multiplication  can  be  calcu¬ 
lated  efficiently  in  time  O  (A  log  A).  Therefore,  in  contrast  to  convex-programming 
algorithms,  NIHT  is  scalable  and  can  handle  much  larger  data  dimensions. 

Nevertheless,  although  the  exact  behavior  of  the  NIHT  algorithm  is  known  in  some 
special  cases  [68],  deriving  sharp  estimation  guarantees  for  the  performance  of  the 
NIHT  algorithm  is  still  an  interesting  open  problem.  However,  in  the  next  sec¬ 
tion  we  provide  several  practical  comparisons  between  the  NIHT  algorithm  and  i\- 
minimization  methods  in  the  compressive  sensing  to  demonstrate  its  superiority. 

13.5  Experimental  Results 

13.5.1  Phase  Transition 

Donoho  and  Tanner’s  combinatorial  geometry  based  theory  precisely  quantifies  the 
fundamental  G-sparsity  and  compression  trade-off  ( k  vs.  M)  that  NIHT  is  compet¬ 
ing  with.  The  theory  predicts  the  exact  location  in  sparsity-undersampling  domain 
where  state-of-the-art  algorithms  exhibit  phase  transitions  in  their  performance.  The 
theory  states  that  CS  algorithms  should  be  able  to  recover  fc-sparse  signals  from 
M  >  2/Hog  (H)  measurements;  this  threshold  appears  quite  sharp  for  Gaussian  par¬ 
tial  Fourier,  and  expander-based  measurement  matrix  ensembles  [98,  29]. 

To  see  how  NIHT  compares  to  the  l\  theoretical  phase  transitions,  we  performed 
Monte  Carlo  simulations  amounting  to  a  month  of  CPU  time.  We  fixed  the  signal 
dimension  to  A  =  1000  and  sweep  across  k  and  M  values  (120  and  200  sample  points, 
respectively).  For  each  (fc,  M)-pair,  we  repeated  the  following  100  times:  (T)  generated 
a  random  sparse  vector  with  unit  norm,  (ii)  generated  compressive  measurements  (no 
noise)  using  Gaussian,  Fourier,  and  expander  graph  sampling  matrices  (incomplete), 
and  (m)  recovered  the  signals  using  Basis  Pursuit  (BP),  FLIHT,  and  NIHT.  Both 
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Figure  13.1:  Phase  transition  curves  FLIHT  (top  row)  and  NIHT  (bottom  row)  are 
compared  to  Donoho- Tanner  bound  (dashed).  Corresponding  failure  percentages  are 
shown. 


FLIHT  and  NIHT  algorithms  use  the  same  number  of  iterations  1000.  We  then 
reported  the  number  of  recoveries  that  obtain  this  accuracy  or  better. 

Figure  13.1  summarizes  the  results  for  Gaussian  and  partial  Fourier  matrices.  The 
results  are  quite  promising  for  NIHT.  For  comparison,  we  also  provide  the  C -magic 
basis  pursuit  results  (the  interior  point  method  where  the  Newton  system  is  solved 
with  conjugate  gradients)  [52],  which  match  the  Donoho-Tanner  phase  transition 
curve  (c.f.,  within  NIHT/Gaussian).  Compared  to  G -magic,  NIHT  increases  the 
number  of  sparse  coefficients  that  can  be  recovered  from  the  same  measurements 
approximately  by  25%.  FLIHT  performs  on  par  with  G -magic.  Both  FLIHT  and 
NIHT  algorithms  achieve  this  performance  at  the  fraction  of  C -magic’s  computational 
cost.  Figure  13.2  shows  the  results  of  the  same  experiment  with  8-regular  expander 
graphs. 
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Figure  13.2:  Phase  transition  curves  of  expander-based  FLIHT  (a)  and  NIHT  (b) 
algorithms  in  expander-based  compressed  sensing.  Here  we  fixed  N  =  1000,  and 
d  =  8.  Corresponding  failure  percentages  are  shown. 


13.5.2  Empirical  Noise  Robustness 

Figure  13.3  illustrates  the  impact  of  the  measurement  noise  on  the  sparse  approxi¬ 
mation  error  of  the  NIHT  algorithm  and  the  Basis  Pursuit  Denoising  algorithm  of 
Theorem  3.7.  Here  we  set  M  =  1024,  N  =  3072,  and  k  =  100.  The  measurement 
noise  elements  are  sampled  iicl  from  a  J\f( 0,  a2)  Gaussian  distribution.  In  this  experi¬ 
ment  we  ranged  a  from  10-6  to  10_1.  Each  experiment  is  repeated  independently  100 
times  in  the  following  way.  We  first  generated  a  1024  x  3072  real-valued  DG  frame, 
and  a  random  Gaussian  matrix  of  the  same  size.  We  then  generated  a  /c-sparse  vectors 
with  random  support,  random  sign,  and  unit  £2  norm.  The  SPGL  package  is  used  for 
solving  the  Basis  Pursuit  Denoising  problem  [244,  243],  and  the  reconstruction  error 
is  measured  as  || ck  —  ck* || 2- 

Figure  13.3  compares  the  mean  1 2  approximation  error  of  the  four  (matrix, algorithms) 
pairs  as  a  function  of  the  measurement  noise.  It  turns  out  that  NIHT  outperforms 
Basis  Pursuit  Denoising  in  high  SNR  regimes,  whereas  Basis  Pursuit  Denoising  is 
the  winner  when  the  noise  level  is  too  high.  Moreover,  there  is  almost  no  difference 
between  the  performance  of  a  real-valued  DG  frame,  and  a  Gaussian  matrix  of  the 
same  size. 
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Figure  13.3:  Noise  tolerance  of  NIHT,  and  Basis  Pursuit  Denoising  algorithms  are 
illustrated  for  k  =  300,  M  =  1024,  and  N  =  3072.  Note  that  there  is  almost 
no  difference  between  the  performance  of  a  real-valued  DG  frame,  and  a  random 
Gaussian  matrix  of  the  same  size. 

13.5.3  Model-based  Recovery 

All-positive  Model 

Figure  13.4  show  the  phase  transition  of  NIHT  with  positive  fc-sparse  signals,  using 
Gaussian,  partial  Fourier,  and  expander-based  sensing  matrices.  We  set  N  =  1000, 
and  each  experiment  was  repeated  independently  100  times.  The  G -magic  results 
are  also  provided  for  comparison.  At  the  end  of  each  iteration,  the  algorithm  only 
maintains  the  k  largest  positive  entries  of  the  recovered  vector.  Observe  that  the 
prior  positivity  information,  i.e.  knowing  that  the  fc-sparse  signal  has  positive  values 
a  priori ,  significantly  increases  the  performance  of  the  NIHT  algorithm. 

Block  Sparsity  Model 

In  this  experiment,  we  considered  a  specific  nested  RUS  model:  block  sparsity.  In  a 
block-sparse  signal,  the  locations  of  the  sparse  coefficients  cluster  in  blocks  under  a 
specific  sorting  order.  Block-sparse  signals  have  been  previously  studied  in  several 
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(c)  Expander. 

Figure  13.4:  Phase  transition  curve  of  the  NIHT  algorithm  with  positive  sparse  sig¬ 
nals  is  compared  to  Donoho-Tanner  bound  (dashed).  Not  that  the  prior  positivity 
knowledge  significantly  improves  the  reconstruction  accuracy. 
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Probability  of  Exact  Recovery  Probability  of  Exact  Recovery 

NIHT  (N=1024,  k=256)  Block-NIHT  (N=1024,  k=256) 


(a)  Expander  (b)  Expander  (model-based  with  block  sparsity) 

Figure  13.5:  The  impact  of  block-sparsity  on  the  performance  of  the  NIHT  algorithm 
is  significant.  Exploiting  the  block  structure  in  addition  to  signal  sparsity,  NIHT 
decreases  the  number  of  measurements  significantly. 

different  applications,  including  DNA  microarrays  and  magnetoencephalography.  An 
equivalent  problem  arises  in  signal  ensembles,  such  as  array  signal  processing  and 
MIMO  communication  [20].  It  has  been  shown  that  the  block-sparse  structure  en¬ 
ables  signal  recovery  from  a  reduced  number  of  CS  measurements  when  the  recovery 
algorithms  exploit  this  specific  structure  [20,  19]. 

The  block  sparse  model  approximation  is  quite  simple:  if  a  sparse  coefficient  is  se¬ 
lected  within  the  predefined  block  of  size  b,  all  the  coefficients  must  be  turned  within 
the  same  block.  Hence,  block  sparse  approximation  is-in  a  way-equivalent  to  un¬ 
structured  sparse  approximation:  instead  of  picking  the  top  ^-coefficients  by  their 
energy,  we  pick  the  top  k/b  blocks  by  summing  up  their  ^-energy.  For  simplicity,  we 
consider  uniform  block  sizes  of  powers  of  2  on  the  signal  vector;  hence,  the  signal 
sparsity  is  also  restricted  to  be  a  power  of  2. 

Figure  13.5  investigates  the  advantage  of  incorporating  the  block-sparsity  informa¬ 
tion  on  the  probability  of  exact  recovery  for  N  =  1024  and  k  =  256.  We  vary  the 
block  sparsity  level  as  b  =  (2, 4, ... ,  128,  256)  and  also  the  number  of  expander-based 
measurements  from  M  =  256  to  1024.  Figure  13.5(a)  plots  the  probability  of  suc¬ 
cessful  recovery  while  using  the  NIHT  algorithm,  whereas  Figure  13.5(b)  plots  the 
probability  of  success  of  the  NIHT  algorithm  with  the  block-sparsity  projection.  We 
observe  that  the  block-sparsity  model  significantly  reduces  the  minimum  number  of 
measurements  required  for  exact  recovery. 
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13.5.4  Compressive  Imaging 


In  our  first  experiment,  we  compared  the  performance  of  the  NIHT  and  Basis  Pursuit 
algorithms  on  recovering  the  128  x  128  phantom  image.  Here  N  =  1282.  We  used 
M  =  0.33  x  N  expander-based  measurements  using  an  expander-graph  with  left 
degree  d  —  8.  We  also  set  the  number  of  iterations  of  the  NIHT  algorithm  to  2000 
with  sparsity  level  k  =  0.5  x  M .  For  sparsity  basis,  we  chose  the  db2  Daubechies 
wavelets  basis,  and  use  the  SPGL  package  [244,  243]  for  solving  the  Basis  Pursuit 
optimization. 

Figure  13.6  compares  the  NIHT  algorithm  with  the  Basis  Pursuit  method.  Since 
the  Lipschitz  constant  for  this  problem  is  not  available,  we  chose  a  large  constant 
and  ran  the  algorithm  for  2000  iterations.  The  reconstruction  SNR  is  measured  as 
SNR  =  — 101og10  where  «*  is  the  wavelet  coefficient  vector  and  a.  is  the 

output  of  the  algorithm.  As  illustrated  in  Figure  13.6,  the  NIHT  algorithm  dominates 
the  Basis  Pursuit  algorithm  in  terms  of  the  reconstruction  accuracy. 

We  also  used  a  real  image  of  size  1024  x  1024  and  generated  compressive  samples 
using  a  scrambled  partial  Fourier  sensing  matrix  with  M  =  0.33  x  N.  For  sparsity 
basis,  we  picked  Daubechies  db8  wavelets  and  judiciously  chose  k  =  0.15  x  A  for 
sparse  recovery.  To  recover  the  target  image,  we  then  ran  NIIHT  for  2000  iterations. 
Figure  13.7  illustrates  the  reconstructed  image,  as  well  as  the  difference  between  the 
reconstruction  image  and  the  true  image.  Here  the  NIHT  algorithm  has  recovery 
SNR  19.81,  on  par  with  the  Basis  Pursuit  reconstruction  SNR. 


(a)  Original  128  x  128  phan-  (b)  BP:  (SNR:  12.86dB.  (c)  NIHT:  (SNR:  15.05dB. 

tom  image. 


Figure  13.6:  Recovery  of  the  128  x  128  phantom  image  using  the  NIHT  and  the  Basis 
Pursuit  algorithms.  The  reconstruction  SNR  of  the  NIHT  algorithm  is  about  2dB 
higher  than  that  of  the  Basis  Pursuit  algorithm. 
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Conclusion 
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Chapter  14 
Conclusion 


In  this  thesis,  we  provided  two  deterministic  alternatives  to  the  classical  random 
sensing  framework.  The  first  alternative  was  constructed  from  the  adjacencies  of 
vertex  expander  graphs,  and  the  second  alternative  was  constructed  from  the  Delsarte- 
Goethals  codes.  It  has  been  known  for  a  long  time  that  both  (explicit)  expander 
graphs,  and  error-correcting  codes,  are  extremely  powerful  pseudo-random  objects. 
This  pseudo-randomness  was  our  first  motivation  for  suggesting  the  deterministic 
sensing  frameworks  of  this  thesis. 

Nevertheless,  in  contrast  to  truly  random  matrices,  one  can  prove  that  neither  the 
expander  based  nor  the  Delsarte-Goethals  based  sensing  matrices  satisfy  the  Re¬ 
stricted  Isometry  Property.  Whereas  almost  every  result  in  the  classical  random 
sensing  framework  relies  on  the  RIP.  Therefore,  this  makes  the  use  of  our  suggested 
deterministic  frameworks  unintuitive.  Hence,  in  order  to  show  the  strength  of  our 
deterministic  sensing  frameworks,  we  introduced  verifiable  conditions  that  are  satis¬ 
fied  by  our  deterministic  matrices,  and  are  sufficient  to  guarantee  successful  sparse 
recovery.  We  also  showed  that  by  exploiting  the  structure  of  our  design  matrices,  one 
can  propose  efficient  reconstruction  algorithms  with  similar,  or  in  cases  even  better, 
performance  than  the  i\  reconstruction  methods. 

The  expander  based  compressive  sensing  framework  provides  efficiency  in  storage, 
computation,  explicit  construction,  and  resilience  against  Poisson  noise.  If  the  noise 
level  is  not  too  high,  reconstruction  is  possible  via  a  simple  message-passing  algo¬ 
rithm  which  requires  at  most  2k  simple  messages.  If  the  noise  level  is  high,  stable 
reconstruction  is  possible  using  an  iterative  algorithm  that  is  obtained  from  a  game 
theoretic  interpretation  of  the  expander  based  sparse  reconstruction  problem. 

The  Delsarte-Goethals  based  compressed  sensing  framework  relies  on  the  coherence 
between  the  rows  and  the  columns  of  the  matrix.  In  this  thesis  we  showed  that  these 
coherence  properties  are  sufficient  to  guarantee  successful  model-selection  and  sparse 
approximation  in  the  average-case  compressed  sensing  framework.  The  Delsarte- 
Goethals  frames  have  optimal  coherence  values,  and  therefore  provide  optimal  average- 
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case  model  selection  guarantees.  The  average-case  CS  is  a  reasonable  model  for  ap¬ 
plications  such  as  multi-user  detection.  Our  experiments  suggested  that  in  the  DG 
sensing  framework,  it  is  possible  to  detect  25  active  users  from  the  set  of  all  232  (32- 
bit)  users  by  taking  only  212  measurements.  Moreover,  the  reconstruction  takes  less 
than  one  minute.  In  contrast,  it  takes  several  weeks  to  recover  the  same  active  users 
in  the  random  sensing  framework. 

One  interesting  future  direction  is  to  apply  the  deterministic  compressed  sensing 
framework  to  new  applications.  In  applications  such  as  speech  or  video  recognition,  or 
quantum  computing,  the  sensing  matrix  is  often  provided  to  us  by  the  nature.  In  those 
cases  the  matrix  hardly  satisfied  the  Restricted  Isometry  Property.  However,  there  is  a 
much  higher  chance  that  the  matrix  satisfies  (some  version  of)  the  coherence  property 
required  in  the  deterministic  sensing  framework.  Adopting  the  sparse  reconstruction 
algorithms  of  this  thesis  to  those  applications  is  a  challenging,  and  also  extremely 
interesting,  future  plan. 
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